You may not know it, your A/B testing tool may be finding differences that really aren’t. [tweetmeme source=”pricingright”] The tool may tell you that version 2 performs better than version 1 because it found statistically significant difference between the performance of the two versions, but in reality there may be no difference. As a result you may end up investing time and resources in more tests, fine tuning minor differences.
The problem comes from three fronts:
- Definition of performance: Using percentage conversation rates, a choice forced by the next point.
- Statistical test used in A/B testing: Using Student t-test.
- Using extremely large samples: Samples larger than 300, a choice forced by the use of t-test on conversion rates.
A/B testing is about finding if there is statistically significant difference, at a preset confidence level (usually 95%) between the performances of the two versions under test. The statistical test that is used by some of the tools is the Student t-test and the performance metric compared is the percentage conversion rate.
Let p1 and p2 are the conversion rates of the two versions. If the difference p2-p1 (or vice versa) is found to be statistically significant, we are told version 2 WILL perform better. Worse, some may even conclude Version 2 WILL perform 47.5% (or some such umber) better based on the math (p2-p1)/p1%.
For the sake of running valid tests, these tools run the tests over long periods of time and collect large amount of data. Then they run the t-test on the entire data, typically thousands of data points.
In a paper titled Rethinking Data Analysis published in the International Journal of Marketing Research, Vol 52, Issue 1, Prof. Ray Kent writes,
For large samples – certainly samples over 300 or so–any association large enough to attract the attention of the researcher will always be statistically significant
With large samples we are violating the Random Sampling requirement for statistical testing. When everything else is held constant, large samples (most of the testings I see use upwards of 5000 sample size) increase statistical significance. Differences that are so small to show up in small samples are magnified in large samples. Large samples have one big problem: they lose all information about segmentation. While you may find no difference between the versions within each segment, put together you will find statistically significant difference with large samples.
Imagine this, suppose you collected 5200 samples for version 1 and 5300 for version 2. Let us say the samples include equal number of male and females. While you may find no statistically significant difference for males and females separately, you might find one for the total. What if you don’t the hidden segmentation dimensions? What about the hidden demographic and psychographic segmentation dimensions that are not teased out? (See below for detailed math.)
The net is, convinced by the magical words, “statistically significant difference”, you end up magnifying differences that are not real differences at all and continue to invest in more and more tests picking Red arrows over Green arrows.
How can you fix it? Stay tuned for the next in this series.
Here is the A/B test math as practiced by popular tools:
Let us use data published in a 2006 article by Avinash Kaushik (used only for illustrative purposes here and in I believe in that post as well).
Let us start with the hypotheses:
H0: p1=p2, any difference in conversion rate is due to chance
H1: p2> p1 Alternate hypothesis (we will be using one-tailed test)
Then you do the experiment. You do send out two offers to potential customer. Here is how the outcomes look:
First we compute the standard error SE for each offer, which is approximated as sqrt( p(1-p)/n ). Note the “n” in the denominator. So higher the “n” lower the SE.
In this example, SE1 = 0.001276 and SE2 = 0.001516. Then we compute the common SE between samples, which is square root of sum of square of SE1 and SE2. Here SE = 0.00198
Then we compute t-stat, p2-p1/SE = 1.72. From the t-table, for degrees of freedom = ∞ (more than 120 is infinity for this table) we find the one-tailed value for p-value 0.05 is 1.645. Since 1.72 > 1.645, we declare statistical significance.
Now let us say that the offers you sent to were to two Geos, US and EMEA. Let us assume exactly half the number of each offer was sent to each Geo. Let us also assume that we received exactly equal number of responses from each Geo.
Your p1 and p2 remain the same but your SE1 increases from 0.001276 to 0.001804 and SE2 increases from 0.001516 to 0.002144. So SE increases to 0.0028.
When you do the t-test for US and EMEA separately the t-stat you compute will be 1.216, less than the 1.645 from the t-table. In other words, there is no statistically significant difference between the two offers for US and so is the case for EMEA. But when we put these together, we found otherwise.
You could counter this by saying we collect 5200 samples for each Geo. What if we the segmentation dimensions are not known in advance? What about other demographic and psychographic segmentation?
Large samples will find statistically significant difference that are in reality not significant at all.
Another mistake is to quote % difference between versions. It is just wrong to say one version performed better than the other by x%. Note that the alternate hypothesis is p2>p1 and it DOES NOT say anything about by how much. So when we find statistically significant difference, we reject H0 but there is nothing in our hypothesis or the method to say p2 performed better than p1 by x%!