Testing 40 shades of blue – AB Testing

The title refers to the famous anecdote about Marissa Mayer testing 40 shades of blue to determine the right color for the links. (Unfortunately I am colorblind, I know just one blue.)

Mayer is famous for many things at Google, but the one that always sticks out – and defines her in some ways – is the “Forty Shades of Blue” episode.

she ordered that 40 different shades of blue would be randomly shown to each 2.5% of visitors; Google would note which colour earned more clicks. And that was how the blue colour you see in Google Mail and on the Google page was chosen.

Thousands of such tests happen in the web world, every website running multiple experiments in a day. Contrary to what most in webapp development may believe AB testing does not have its origins in webapp world. It is simply an application of statistical testing, Randomized Control Trial, to determine if a ‘treatment’ made a difference on the performance of treatment group compared to performance of control group.

The simplest test is testing if the observed difference between the two sample means are statistically significant. What that means is measuring the probability, p-value, the difference is just random. If p-value is less than a preset level we declare the treatment made a difference.

Does it matter if the results are statistically significant? See here why it does not:

“I have published about 800 papers in peer-reviewed journals and every single one of them stands and falls with the p-value. And now here I find a p-value of 0.0001, and this is, to my way of thinking, a completely nonsensical relation.”

Should you test 40 shades of blue to find the one that produces most click-thrus or conversions? xkcd has the answer:

Can Ms. Mayer test the way out of Yahoo’s current condition? Remember all these split testing are about finding lower hanging fruits not quantum leaps. And as Jim Manzi wrote in his book Uncontrolled,

Perhaps the single most important lesson I learned in commercial experimentation, and that I have since seen reinforced in one social science discipline after another, is that there is no magic. I mean this in a couple of senses. First, we are unlikely to discover some social intervention that is the moral equivalent of polio vaccine. There are probably very few such silver bullets out there to be found. And second, experimental science in these fields creates only marginal improvements. A failing company with a poor strategy cannot blindly experiment its way to success …

You can’t make up for poor strategy with incessant experimentation.

Significance of Random Flukes

One sure way to kill a joke is to explain it. I hate to kill this great and clever joke on statistical significance, but here it goes. May be you want to just read the joke, love it, treasure it and move on without reading rest of the article.

Love it? I love this one for its simple elegance. You can leave now if you do not want to see this dissected.

First the good things.

The “scientists” start with hypotheses external to the data source, collect data and test for statistical significance. They likely used 1-tailed t-test and ran a between groups experiment.

One group was control group that did not eat the jelly beans. Other group was the treatment group that was treated with jelly beans.

The null hypothesis H0 is, “ Any observed difference between the number of occurrences of acne between the two groups is just due to coincidence”.

The alternative hypothesis H1 is, “The differences are statistically significant. The jelly beans made a difference”.

They use p value of 0.05 (95% confidence). p-value of 0.05 means there is only 5% probability the observed result can be entirely due to chance. If the computed p-value is less than 0.05 (p<0.05), they reject H0 and accept H1. If the computed p-value is greater than 0.05  (p>0.05) H0 cannot be rejected, it is all random.

They run a total of 21 experiments.

The first is the aggregate. They likely used a large jar of mixed color jelly beans and ran the test and found no compelling evidence to overthrow the null hypothesis that it was just coincidence.

Then they run 20 more experiments, one for each color. They find that in 19 of the experiments (with 19 different colors) they cannot rule out coincidence. But in one experiment using green jelly beans they find p less than 0.05. They  reject H0 and accept H1 that green jelly beans made a difference in the number of occurrences of acne.

In 20 out of 21 experiments (95.23%), the results were not significant to toss out coincidence as the reason. In 1  out of 21 experiments (4.77% ) it was and hence green was linked to acne.

In other words, there was 95.23% probability (p=0.9523) that any observed link between jelly bean and acne is just random.

However the media reports, “Green jelly beans linked to acne at 95% confidence level”, because that experiment found p<0.05.

Green color is the spurious variable. The fact that the Green experiment had p<0.05 could easily be because this experiment run happened to have high concentration of random flukes in it.

The very definition of statistical significance testing using random sampling is just that.

If we have not seen the first experiment  or the 19 other experiments that had p>0.05, we would be tempted to accept the link between green jelly beans and acne. Since we saw all the negative results, we know better.

In reality, we don’t see most if not all of the negative findings. Only the positive results get written up – be it the results of an A/B test that magically increased conversion or scientific research.
After all it is not interesting to read how changing just one word did not have a effect on conversion rates. Such negative findings deserve their place in the round filing cabinet.

By rejecting all negative findings and choosing only postive findings, the experimenters violate the rule of random sampling and  highlight the high concentration of random flukes as breakthrough.

The next step in this slippery slope of pseudo statistical testing is Data Dredging. Here one skips the need for initial hypotheses and simply launches into data for finding “interesting correlations”.
Data Dredging is slicing up data in every possible dimension to find something – anything.

For example, “Eating Green jelly beans with left hand while standing up on Tuesdays” causes acne.

If you find this claim is so ridiculous that you will not fall for it, consider all the articles you have seen about the best days to run email marketing  and best days to tweet OR How to do marketing like  certain brands.

Can you spot the fact that these are based on Data Dredging?

(See here for great article on Data Dredging).

In this age of instant publication, easy experimentation, Big Data and social media echo chamber, how can you spot and stay clear of Random Flukes reported as scientific research?

You can start with this framework:

  1. What are the initial hypotheses before collecting data? If there is none, thanks but no thanks. (Data Dredging)
  2. How are these hypotheses arrived at? If these were derived from the very data they are tested with, keep moving. A great example of this is the class of books, “7 Habits of …”
  3. See a study from extremely large samples? Many of the random flukes that do not show up in small samples do get magnified in large samples. It is just due to the mathematical artifact. Again, thanks but no thanks.
  4. Very narrow finding? It is the Green jelly bean again ask about other dimensions.

Or you can just plain ignore all these nonsensical findings camouflaged in analytics.