One sure way to kill a joke is to explain it. I hate to kill this great and clever joke on statistical significance, but here it goes. May be you want to just read the joke, love it, treasure it and move on without reading rest of the article.
First the good things.
The “scientists” start with hypotheses external to the data source, collect data and test for statistical significance. They likely used 1-tailed t-test and ran a between groups experiment.
One group was control group that did not eat the jelly beans. Other group was the treatment group that was treated with jelly beans.
The null hypothesis H0 is, “ Any observed difference between the number of occurrences of acne between the two groups is just due to coincidence”.
The alternative hypothesis H1 is, “The differences are statistically significant. The jelly beans made a difference”.
They use p value of 0.05 (95% confidence). p-value of 0.05 means there is only 5% probability the observed result can be entirely due to chance. If the computed p-value is less than 0.05 (p<0.05), they reject H0 and accept H1. If the computed p-value is greater than 0.05 (p>0.05) H0 cannot be rejected, it is all random.
They run a total of 21 experiments.
The first is the aggregate. They likely used a large jar of mixed color jelly beans and ran the test and found no compelling evidence to overthrow the null hypothesis that it was just coincidence.
Then they run 20 more experiments, one for each color. They find that in 19 of the experiments (with 19 different colors) they cannot rule out coincidence. But in one experiment using green jelly beans they find p less than 0.05. They reject H0 and accept H1 that green jelly beans made a difference in the number of occurrences of acne.
In 20 out of 21 experiments (95.23%), the results were not significant to toss out coincidence as the reason. In 1 out of 21 experiments (4.77% ) it was and hence green was linked to acne.
In other words, there was 95.23% probability (p=0.9523) that any observed link between jelly bean and acne is just random.
However the media reports, “Green jelly beans linked to acne at 95% confidence level”, because that experiment found p<0.05.
Green color is the spurious variable. The fact that the Green experiment had p<0.05 could easily be because this experiment run happened to have high concentration of random flukes in it.
The very definition of statistical significance testing using random sampling is just that.
If we have not seen the first experiment or the 19 other experiments that had p>0.05, we would be tempted to accept the link between green jelly beans and acne. Since we saw all the negative results, we know better.
In reality, we don’t see most if not all of the negative findings. Only the positive results get written up – be it the results of an A/B test that magically increased conversion or scientific research.
After all it is not interesting to read how changing just one word did not have a effect on conversion rates. Such negative findings deserve their place in the round filing cabinet.
By rejecting all negative findings and choosing only postive findings, the experimenters violate the rule of random sampling and highlight the high concentration of random flukes as breakthrough.
The next step in this slippery slope of pseudo statistical testing is Data Dredging. Here one skips the need for initial hypotheses and simply launches into data for finding “interesting correlations”.
Data Dredging is slicing up data in every possible dimension to find something – anything.
For example, “Eating Green jelly beans with left hand while standing up on Tuesdays” causes acne.
If you find this claim is so ridiculous that you will not fall for it, consider all the articles you have seen about the best days to run email marketing and best days to tweet OR How to do marketing like certain brands.
Can you spot the fact that these are based on Data Dredging?
(See here for great article on Data Dredging).
In this age of instant publication, easy experimentation, Big Data and social media echo chamber, how can you spot and stay clear of Random Flukes reported as scientific research?
You can start with this framework:
- What are the initial hypotheses before collecting data? If there is none, thanks but no thanks. (Data Dredging)
- How are these hypotheses arrived at? If these were derived from the very data they are tested with, keep moving. A great example of this is the class of books, “7 Habits of …”
- See a study from extremely large samples? Many of the random flukes that do not show up in small samples do get magnified in large samples. It is just due to the mathematical artifact. Again, thanks but no thanks.
- Very narrow finding? It is the Green jelly bean again ask about other dimensions.
Or you can just plain ignore all these nonsensical findings camouflaged in analytics.