Suppose you have a service – be it a web based service or a brick and mortar service. Visitors walk through the front door. Most just leave without taking an action that is favorable to you. Some do and become converts.

As you function along, you form a belief/gut-feel/hypothesis that color of the door affects how many will convert. Specifically, certain color will improve conversion. (I am color blind, else I will call out the colors I used below)

You verify this hypothesis by running a split test. You evenly split your visitor stream, randomly sending them through Door A of current color or Door B of the new color which is supposed to increase conversion. This is the A/B split test.

How do you verify your hypothesis?

The most common way that is practiced by every A/B test tool in the market is shown below

These tools keep both Converts and Non-Converts for a given Door together and treats each as a separate population. Those who went through Door A (both Converts and Non-Converts) are kept separate from those who went through Door B. They test the hypothesis that the proportion of converts in the Door B population is higher than proportion of converts in the Door A population. The tools assume that the population data are normally distributed and use a 2-sample t-test to verify the difference between the two proportions is statistically significant.

What is wrong with this approach? For starters, you can see how it rewrites the hypothesis and re-wires the model. This approach treats conversion as an attribute of the visitor. This is using the t-test for the wrong purpose or using the wrong statistical test for A/B testing.

For example, if you want to test whether there is higher prevalence of heart disease among Indians living in US vs. India, you will draw random samples from the two populations and, measure the proportion of heart disease in each sample and do a t-test to see if the difference is statistically significant. That is a valid use of t-test for population proportions.

Conversion isn’t same as measuring proportion of population characteristic like heart disease. Treating the conversion rate as a characteristic of the visitor is contrived. You also need to keep the Converts and Non-Converts together while you only need to look at those who converted.

Is there another way?

Yes. Take a look at this model that closely aligns with the normal flow. We really do not care about the Non-Converts and we test the correct hypothesis that more Converts came through Door B than through Door A.

This method grabs a random sample of Converts and tests whether there are more that came through Door B than through Door A. It uses Chi-square test to verify that the difference is not just due to randomness. No other assumptions needed like assuming normal distribution and it tests the right hypothesis. Most importantly it fits the flow and model before we introduced Door B.

Want to know more? Want to know the implications of this and how you can influence your A/B test tool vendors to change? Drop me a note.

How do you account for door A and B not being shown the same number of times? e.g. if you look at the number of conversions and you see that A delivered more conversions than B the result is clearly different if B was shown 10% of the time or 50% of the time.

LikeLike

Hi Ryan,

I think you misunderstood the post. users don’t have choice, they just been sent through door A or door B

LikeLike

Ryan

I am with you on the statistical significance.

The post was not meant as a recommendation for A/B testing, but the opposite.

-Rags

LikeLike

I’m not sure that the door A and door B example you use accurately describe the A/B testing that is normally done in internet marketing. Users are not presented with door A and door B, then choose one. They are presented with door A or door B, then a behavior either happens or it does not. I would argue that this results in two different populations and that “conversion rate” is an attribute of the population. Additionally, I think marketers get far too concerned with statistical significance in A/B tests and fail to understand the reasons for the behaviors that are taking place.

LikeLike