Suppose you have a service – be it a web based service or a brick and mortar service. Visitors walk through the front door. Most just leave without taking an action that is favorable to you. Some do and become converts.
As you function along, you form a belief/gut-feel/hypothesis that color of the door affects how many will convert. Specifically, certain color will improve conversion. (I am color blind, else I will call out the colors I used below)
You verify this hypothesis by running a split test. You evenly split your visitor stream, randomly sending them through Door A of current color or Door B of the new color which is supposed to increase conversion. This is the A/B split test.
How do you verify your hypothesis?
The most common way that is practiced by every A/B test tool in the market is shown below
These tools keep both Converts and Non-Converts for a given Door together and treats each as a separate population. Those who went through Door A (both Converts and Non-Converts) are kept separate from those who went through Door B. They test the hypothesis that the proportion of converts in the Door B population is higher than proportion of converts in the Door A population. The tools assume that the population data are normally distributed and use a 2-sample t-test to verify the difference between the two proportions is statistically significant.
What is wrong with this approach? For starters, you can see how it rewrites the hypothesis and re-wires the model. This approach treats conversion as an attribute of the visitor. This is using the t-test for the wrong purpose or using the wrong statistical test for A/B testing.
For example, if you want to test whether there is higher prevalence of heart disease among Indians living in US vs. India, you will draw random samples from the two populations and, measure the proportion of heart disease in each sample and do a t-test to see if the difference is statistically significant. That is a valid use of t-test for population proportions.
Conversion isn’t same as measuring proportion of population characteristic like heart disease. Treating the conversion rate as a characteristic of the visitor is contrived. You also need to keep the Converts and Non-Converts together while you only need to look at those who converted.
Is there another way?
Yes. Take a look at this model that closely aligns with the normal flow. We really do not care about the Non-Converts and we test the correct hypothesis that more Converts came through Door B than through Door A.
This method grabs a random sample of Converts and tests whether there are more that came through Door B than through Door A. It uses Chi-square test to verify that the difference is not just due to randomness. No other assumptions needed like assuming normal distribution and it tests the right hypothesis. Most importantly it fits the flow and model before we introduced Door B.
Want to know more? Want to know the implications of this and how you can influence your A/B test tool vendors to change? Drop me a note.