Is Your A/B Testing Tool Finding Differences That Aren’t?

You may not know it, your A/B testing tool may be finding differences that really aren’t. [tweetmeme source=”pricingright”] The tool may tell you that version 2 performs better than version 1 because it found statistically significant difference between the performance of the two versions, but in reality there may be no difference. As a result you may end up investing time and resources in more tests, fine tuning minor differences.

The problem comes from three fronts:

  1. Definition of performance: Using percentage conversation rates, a choice forced by the next point.
  2. Statistical test used in A/B testing: Using Student t-test.
  3. Using extremely large samples: Samples larger than 300, a choice forced by the use of t-test on conversion rates.

A/B testing is about finding if there is statistically significant difference, at a preset confidence level (usually 95%)  between the performances of the two versions under test. The statistical test that is used by some of the tools is the Student t-test and the performance metric compared is the percentage conversion rate.

Let p1 and p2 are the conversion rates of the two versions. If the difference p2-p1 (or vice versa)  is found to be statistically significant, we are told version 2 WILL perform  better. Worse, some may even conclude Version 2  WILL perform 47.5% (or some such umber) better based on the math (p2-p1)/p1%.

For the sake of running valid tests, these tools run the tests over long periods of time and collect large amount of data. Then they run the t-test on the entire data, typically thousands of data points.

In a paper titled Rethinking Data Analysis published in the International Journal of Marketing Research, Vol 52, Issue 1, Prof. Ray Kent  writes,

For large samples – certainly samples over 300 or so–any association large enough to attract the attention of the researcher will always be statistically significant

With large samples we are violating the Random Sampling requirement for statistical testing. When everything else is held constant, large samples (most of the testings I see use upwards of 5000 sample size) increase statistical significance.  Differences that are so small to show up in small samples are magnified in large samples. Large samples have one big problem: they lose all information about segmentation. While you may find no difference between the versions within each segment, put together you will find statistically  significant difference with large samples.

Imagine this,  suppose you collected 5200 samples for version 1 and 5300 for version 2. Let us say the samples include equal number of male and females.  While you may find no statistically significant difference for males and females separately, you might find one for the total.  What if you don’t the hidden segmentation dimensions? What about the hidden demographic and psychographic segmentation dimensions that are not teased out? (See below for detailed math.)

The net is, convinced  by the magical words, “statistically significant difference”, you end up magnifying differences that are not real differences at all and continue to invest in more and more tests picking  Red arrows over Green arrows.

How  can you fix it? Stay tuned for the next in this series.


Here is the A/B test math as practiced by popular tools:

Let us use data published in a 2006 article by Avinash Kaushik (used only for illustrative purposes here and in I believe in that post as well).

Let us start with the hypotheses:

H0: p1=p2, any difference in conversion rate is due to chance
H1: p2> p1 Alternate hypothesis  (we will be using one-tailed test)

Then you do the experiment. You do send out two offers to potential customer. Here is how the outcomes look:

  • Offer One Responses: n1 = 5,300. Order: 46. Hence Conversion Rate p1 = 0.87%
  • Offer Two Responses: n2 = 5,200. Order: 63. Hence Conversion Rate p2 = 1.21%
  • First we compute the standard error SE for each offer, which is approximated as  sqrt( p(1-p)/n ). Note the  “n” in the denominator. So higher the “n” lower the SE.

    In this example, SE1 = 0.001276 and SE2 = 0.001516. Then we compute the common SE between samples, which is square root of sum of square of SE1 and SE2. Here SE = 0.00198

    Then we compute t-stat, p2-p1/SE = 1.72. From the t-table, for degrees of freedom = ∞ (more than 120 is infinity for this table) we find the one-tailed value for p-value 0.05 is 1.645. Since 1.72 > 1.645, we declare statistical significance.

    Now let us say that the offers you sent to were to two Geos, US and EMEA. Let us assume exactly half the number of each offer was sent to each Geo. Let us also assume that we received exactly equal number of responses from each Geo.

    Your p1 and p2 remain the same but your SE1 increases from 0.001276 to 0.001804 and SE2 increases from 0.001516 to 0.002144. So SE increases to 0.0028.

    When you do the t-test for US and EMEA separately the t-stat you compute will be 1.216, less than the 1.645 from the t-table.  In other words, there is no statistically significant difference between the two offers for US and so is the case for EMEA. But when we put these together, we found otherwise.

    You could counter this by saying we collect 5200 samples for each Geo. What if we the segmentation dimensions are not known in advance? What about other demographic and psychographic segmentation?

    Large samples will find statistically significant difference that are in reality not significant at all.

    Another mistake is to quote % difference between versions. It is just wrong to say one version performed better than the other by x%. Note that the alternate hypothesis is p2>p1 and it DOES NOT say anything about by how much. So when we find statistically significant difference, we reject H0 but there is nothing in our hypothesis or the method to say p2 performed better than p1 by x%!

    Who Makes the Hypothesis in a Hypothesis Testing?

    Most of my  works on pricing and consumer behavior studies rely on hypothesis testing.  Be it finding difference in means between two groups, non-parametric test or making a causation claim, explicitly or implicitly I apply hypothesis testing. I make overarching claims about customer willingness to pay and what factors influence it based on hypothesis testing. The same is true for the most popular topic, these days, for anyone with a web page – AB split testing. Nothing wrong with these methods and I bet I will continue to use these methods in all my other works.

    We should note however  that the use of hypothesis and finding statistically significant difference should not blind us to the fact that there is some amount of subjectivity that go into all these. Another important distinction to note is, despite the name hypothesis testing we are not testing whether the hypothesis is validated but whether the data fits the hypothesis which we take it as given. More on this below.

    All these testings proceed as follows:

    1. Start with the hypothesis. In fact you always start with two, the null hypothesis which is the same for any statistical testing
      The Null hypothesis H0: The observed difference between subjects (or groups) is just due to randomness.
      Then you write down the hypothesis that you want to make a call on.
      Alternative hypothesis H1: The observed difference between subjects (or groups) is indeed due to one or more treatment factors that you control for.
    2. Pick the statistical test you want to use among those available given your case. Be it a non-parametric test like Chi-square  that makes no assumption about the distribution of data (AB testing) or parametric test like t-test that assumes Gaussian distribution (e.g., normal) of data.
    3. Select a critical value or confidence level for the test 90%,95%, 99% with 95% being the most common. This is completely subjective. What you are stating with the critical value is the results are statistically significant only if these can be caused due to randomness in less than 5% (100-95%) of the cases. The critical value is also expressed as p value ( probability ), in this case 0.05.
    4. Perform the test with random sampling. This needs more explanation but is beyond the scope of what I want to cover here.

    As you can see, we the analyst/decision maker make up the hypothesis and we are treating the hypothesis as given.  We did the right thing of writing it first. ( A common mistake in many of the AB tests and in data mining exercises is writing the hypothesis after the test.)

    What we are testing is, given this hypothesis H1 is true  (P(H1)=1) what is the probability the test data D fits the hypothesis.

    This is expressed as P(D|H1).  Statistical significance here means P(D|H1) > 0.95 given P(H1) =1.

    When we say we accept H1, we are really saying H0 (randomness) cannot be the reason and hence H1 must be true. We rule out the fact that the observed data can be explained by any number of alternative hypotheses. Since we wrote the original hypothesis, if we did not base it on proper qualitative analysis then we could be wrong despite the fact  our tests yields statistically significant results.

    This is why you should never launch a survey without doing focus groups and customer interviews. This is why you don’t jump into statistical testing before understanding enough about the subjects under study to frame relevant hypothesis.  Otherwise you are, as some wrote to me, using gut feel or pulling things out of thin air and accepting it simply because there is not enough evidence in the data to overturn the null hypothesis.

    How do you come up with your hypotheses?

    Look for my next article on how this is different in Bayesian statistics.

    We are increasing prices because …

    [tweetmeme source=”pricingright”] I saw a notice posted on the external doors of an ice rink that said,

    Please close the doors behind you otherwise the rink will fog up

    I did not stand around to measure how many followed the advice and whether this number was better than what it would have been if the sign had simply asked “Please close the door behind you”.  But other people have done such studies.

    In the book Influence: The Psychology of Persuasion (Collins Business Essentials) author Robert B. Cialdini narrates the work done by Harvard Social Psychologist Ellen Langer on the power of the word “because”.

    People simply like to have reasons for what they do.

    It does not matter how relevant or meaningful the reason is. The word “because” made the difference in people accepting your request. This isn’t to say that giving reasons for requests works universally but  it does help to reduce resistance.

    Take the case of price increases. When a marketer pushes through price increases without extending any reason customers resist those increases and perceive the price increase as unfair. But if the price increase were justified with a reason, a greater number of customers will accept it. In their paper titled, Perceptions of Price Fairness, researchers Gielissen, Dutilh,and Graafland  validated the hypothesis that price increases justified with cost arguments were perceived to be fair by customers.

    Ellen Langer’s and Cialdini’s work point to another possible reason for customer acceptance of higher prices – it is not the justification itself but the mere presence of one.  This opens up opportunities for both B2C and B2B marketers to re-price their offering or capture greater value without turning away customers – just give a reason.

    We see that in the earnings results of CPG brands that used commodity price increase in 2008 to push through their price increases.

    Another case is for two-part pricing – asking customers to pay an upfront fee and then a per unit price.  Examples are mobile phone activation fee or registration fee charged by services. These upfront fees are nothing but pure profit for the marketer and find customer acceptance when justified with reasons, however trivial, like  processing fee or registration fee.  For B2B case, a marketer can charge additional upfront price with reasons like customizations or order processing.

    Just give a reason! – “We are increasing prices otherwise we will go out of business”

    I should note that this is a pricing tactic and not a strategy – if your strategy is wrong, any number of fine tuning tactics, even with reasons, are not going to help.

    Footnote: It is a good idea to A/B test your reasons even though Cialdini and Langer say the specific reason is immaterial.