A Closer Look at A/B testing

Suppose you have a service – be it a web based service or a brick and mortar service.  Visitors walk through the front door. Most just leave without taking an action that is favorable to you. Some do and become converts.

As you  function along, you form a belief/gut-feel/hypothesis  that color of the door affects how many will convert.  Specifically, certain color will improve conversion. (I am color blind, else I will call out the colors I used below)

You verify this hypothesis by running a split test. You evenly split your visitor stream, randomly sending them through Door A of current color or Door B of the new color which is supposed to increase conversion. This is the A/B split test.

How do you verify your hypothesis?

The most common way that is practiced by every A/B test tool in the market is shown below

These tools keep  both Converts and Non-Converts for a given Door together and treats each as a separate population.  Those who went through Door A  (both Converts and Non-Converts) are kept separate from those who went through Door B.  They test the hypothesis that the proportion of converts in the Door B population is higher than proportion of converts in the Door A population. The tools assume that the population data are normally distributed and use a 2-sample t-test to verify the  difference between the two proportions is statistically significant.

What is wrong with this approach? For starters, you can see how it rewrites the hypothesis and re-wires the model. This approach treats conversion as an attribute of the visitor. This is using the t-test for the wrong purpose or using the wrong statistical test for A/B testing.

For example, if you want to test whether there is higher prevalence of heart disease among Indians living in US vs. India, you will draw random samples from the two populations and, measure the proportion of heart disease in each sample and do a t-test to see if the difference is statistically significant. That is a valid use of t-test for population proportions.

Conversion isn’t same as measuring proportion of population characteristic like heart disease. Treating the conversion rate as a characteristic of the visitor is contrived. You also need to keep the Converts and Non-Converts together while you only need to look at those who converted.

Is there another way?

Yes. Take a look at this model that closely aligns with the normal flow. We really do not care about the Non-Converts and we test the correct hypothesis that more Converts came through Door B than through Door A.

This method grabs a random sample of Converts and tests whether there are more that came through Door B than through Door A. It uses Chi-square test to verify that the difference is not just due to randomness. No other assumptions needed like assuming normal distribution and it tests the right hypothesis. Most importantly it fits the flow and model before we introduced Door B.

Want to know more? Want to know the implications of this and how you can influence your A/B test tool vendors to change?  Drop me a note.

Whale of a Sample Size in Statistical Testing

Australia is taking Japan to court to stop Japan from killing whales in the name of scientific testing.  The whales that are captured and killed for “research” are later sold as food. In a year, Japan harpoons and kills about 1000 whales for their research work.

What is this gotta do with  statistical significance?

We have to go all the way back to 2005, when Japan implemented what it called JARPA-2 against the wishes of International Whaling Commission. By JARPA-2, Japan increased Whale intake from its then sampling rate of 440 whales to 1000 whales.

We will implement JARPA-2 according to the schedule, because the sample size is determined in order to get statistically significant results

When everything else is held constant, increasing sample size from 440 to 1000 will increase statistical significance because of the way the standard error SE is computed. SE that measures sampling precision goes from  σ/√440 to σ/√1000 a lower number that almost guarantees statistical significance. (see reference)

Under the cloak of statistical significance more whales are being sampled without regard to the economic significance and ecological significance.

Consider this in the context of your A/B testing. Yes even minor differences will appear statistically significant by the magic of large samples. But statistical significance means is not sufficient, we need to ask do these differences have economical significance? Should we chase these tiny differences and lose the opportunity to get the rest of the 97% who are not converting?