Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It might be below the fold, but it looks like they're missing the most important p-hacking strategy of all: the dogshit null hypothesis. It's very reliable and it's the most common type of p-hacking that I see.

It's easy to create a dogshit null hypotheses by negligence or by "negligence" and it's easy to reject a dogshit null hypothesis by simply collecting enough data as it automatically crumbles on contact with the real world -- that's what makes it dogshit. One might hope that this would be caught by peer review (insist on controls!) but I see enough dogshit null hypotheses roaming around the literature that these hopes are about as realistic as fairy dust. In practice, the dogshit null hypothesis reins supreme, or more precisely it quietly scoots out of the way so that its partner in crime, the dogshit alternative hypothesis, can have an unwarranted moment in the spotlight.



This would be much better with an example


If I understand the parent commenter, here's a common example from population-level statistics like public health:

"State X saw a mortality rate last year that was statistically significantly higher than the national rate. We should focus our intervention there."

The null hypothesis is that the risks of death are exactly the same in the state vs the nation. That may work with experimental sample sizes, but at the population level you'll often have massive sample sizes. A statistically significant difference is not interesting by itself. It's just the first hurdle to jump before even discussing the importance of the difference. But I've seen publications (especially data reports with sprinklings of discussion) focus entirely on statistical significant differences in narrative next to tables.

This isn't P-hacking an experiment, but it is abusing and misunderstanding statistical significance to make decisions.


"I ran a t-test on the untreated / treated samples and the difference is significant! The treatment worked!"

...but the data table shows a clear trend over time across both groups because the samples were being irradiated by intense sunlight from a nearby window. The model didn't account for this possibility, so it was rejected, just not because the treatment worked.

That's a relatively trivial example and you can already imagine ways in which it could have occurred innocently and not-so-innocently. Most of the time it isn't so straightforward. The #1 culprit I see is failure to account for some kind of obvious correlation, but the ways in which a null hypothesis can be dogshit are as numerous and subtle as the number of possible statistical modeling mistakes in the universe because they are the same thing.


I think you're more observing an issue with experimental models not challenging a null hypothesis, than with poor null hypotheses themselves. In other words, papers creating experiments that don't actually challenge the hypothesis. There was a major example of this with COVID. A typical way observational studies assessed the efficacy of the vaccines was by looking at outcomes between normalized samples of nonvaccinated and vaccinated individuals who came to the hospital and seeing their overall outcomes. Unvaccinated individuals generally had worse outcomes, so therefore the vaccines must be effective.

This logic was used repeatedly, but it fails to account for numerous obvious biases. For instance unvaccinated people are generally going to be less proactive in seeking medical treatment, and so the average severity of a case that causes them to go to the hospital is going to be substantially greater than for a vaccinated individual, with an expectation of correspondingly worse overall outcomes. It's not like this is some big secret - most papers mentioned this issue (among many others) in the discussion, but ultimately made no effort to control for it.


I'm not great at stats so I don't understand this example. Wouldn't the sunlight affect both groups equally? How can an equal exposure to sunlight create a significant difference?


> looks like they're missing the most important p-hacking strategy of all: the dogshit null hypothesis

Would you mind giving an example(s) of such and how it differs from a "good" null hypothesis?


Null hypotheses are often idealized distributions that are mathematically convenient and are often over-simplifications of the distributions we'd expect if there were truly no effect (because the expected distributions are either intractable to work with, or irregular and unknown).

So for example, suppose you want to detect if there's unusual patterns in website traffic -- a bot attack or unexpected popularity spike. You look at page views per hour over several days, with the null hypothesis that page views are normally distributed, with constant mean and variance over time.

You run a test, and unsurprisingly, you get a really low p-value, because web traffic has natural fluctuations, it's heavier during the day, it might be heavier on weekends, etc.

The test isn't wrong -- it's telling you that this data is definitely not normally distributed with constant mean and variance. But it's also not meaningful because it's not actually answering the question you're asking.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: