Testing 40 shades of blue – AB Testing

The title refers to the famous anecdote about Marissa Mayer testing 40 shades of blue to determine the right color for the links. (Unfortunately I am colorblind, I know just one blue.)

Mayer is famous for many things at Google, but the one that always sticks out – and defines her in some ways – is the “Forty Shades of Blue” episode.

she ordered that 40 different shades of blue would be randomly shown to each 2.5% of visitors; Google would note which colour earned more clicks. And that was how the blue colour you see in Google Mail and on the Google page was chosen.

Thousands of such tests happen in the web world, every website running multiple experiments in a day. Contrary to what most in webapp development may believe AB testing does not have its origins in webapp world. It is simply an application of statistical testing, Randomized Control Trial, to determine if a ‘treatment’ made a difference on the performance of treatment group compared to performance of control group.

The simplest test is testing if the observed difference between the two sample means are statistically significant. What that means is measuring the probability, p-value, the difference is just random. If p-value is less than a preset level we declare the treatment made a difference.

Does it matter if the results are statistically significant? See here why it does not:

“I have published about 800 papers in peer-reviewed journals and every single one of them stands and falls with the p-value. And now here I find a p-value of 0.0001, and this is, to my way of thinking, a completely nonsensical relation.”

Should you test 40 shades of blue to find the one that produces most click-thrus or conversions? xkcd has the answer:

Can Ms. Mayer test the way out of Yahoo’s current condition? Remember all these split testing are about finding lower hanging fruits not quantum leaps. And as Jim Manzi wrote in his book Uncontrolled,

Perhaps the single most important lesson I learned in commercial experimentation, and that I have since seen reinforced in one social science discipline after another, is that there is no magic. I mean this in a couple of senses. First, we are unlikely to discover some social intervention that is the moral equivalent of polio vaccine. There are probably very few such silver bullets out there to be found. And second, experimental science in these fields creates only marginal improvements. A failing company with a poor strategy cannot blindly experiment its way to success …

You can’t make up for poor strategy with incessant experimentation.