Is Your A/B Testing Tool Finding Differences That Aren’t?

You may not know it, your A/B testing tool may be finding differences that really aren’t. [tweetmeme source=”pricingright”] The tool may tell you that version 2 performs better than version 1 because it found statistically significant difference between the performance of the two versions, but in reality there may be no difference. As a result you may end up investing time and resources in more tests, fine tuning minor differences.

The problem comes from three fronts:

  1. Definition of performance: Using percentage conversation rates, a choice forced by the next point.
  2. Statistical test used in A/B testing: Using Student t-test.
  3. Using extremely large samples: Samples larger than 300, a choice forced by the use of t-test on conversion rates.

A/B testing is about finding if there is statistically significant difference, at a preset confidence level (usually 95%)  between the performances of the two versions under test. The statistical test that is used by some of the tools is the Student t-test and the performance metric compared is the percentage conversion rate.

Let p1 and p2 are the conversion rates of the two versions. If the difference p2-p1 (or vice versa)  is found to be statistically significant, we are told version 2 WILL perform  better. Worse, some may even conclude Version 2  WILL perform 47.5% (or some such umber) better based on the math (p2-p1)/p1%.

For the sake of running valid tests, these tools run the tests over long periods of time and collect large amount of data. Then they run the t-test on the entire data, typically thousands of data points.

In a paper titled Rethinking Data Analysis published in the International Journal of Marketing Research, Vol 52, Issue 1, Prof. Ray Kent  writes,

For large samples – certainly samples over 300 or so–any association large enough to attract the attention of the researcher will always be statistically significant

With large samples we are violating the Random Sampling requirement for statistical testing. When everything else is held constant, large samples (most of the testings I see use upwards of 5000 sample size) increase statistical significance.  Differences that are so small to show up in small samples are magnified in large samples. Large samples have one big problem: they lose all information about segmentation. While you may find no difference between the versions within each segment, put together you will find statistically  significant difference with large samples.

Imagine this,  suppose you collected 5200 samples for version 1 and 5300 for version 2. Let us say the samples include equal number of male and females.  While you may find no statistically significant difference for males and females separately, you might find one for the total.  What if you don’t the hidden segmentation dimensions? What about the hidden demographic and psychographic segmentation dimensions that are not teased out? (See below for detailed math.)

The net is, convinced  by the magical words, “statistically significant difference”, you end up magnifying differences that are not real differences at all and continue to invest in more and more tests picking  Red arrows over Green arrows.

How  can you fix it? Stay tuned for the next in this series.


Here is the A/B test math as practiced by popular tools:

Let us use data published in a 2006 article by Avinash Kaushik (used only for illustrative purposes here and in I believe in that post as well).

Let us start with the hypotheses:

H0: p1=p2, any difference in conversion rate is due to chance
H1: p2> p1 Alternate hypothesis  (we will be using one-tailed test)

Then you do the experiment. You do send out two offers to potential customer. Here is how the outcomes look:

  • Offer One Responses: n1 = 5,300. Order: 46. Hence Conversion Rate p1 = 0.87%
  • Offer Two Responses: n2 = 5,200. Order: 63. Hence Conversion Rate p2 = 1.21%
  • First we compute the standard error SE for each offer, which is approximated as  sqrt( p(1-p)/n ). Note the  “n” in the denominator. So higher the “n” lower the SE.

    In this example, SE1 = 0.001276 and SE2 = 0.001516. Then we compute the common SE between samples, which is square root of sum of square of SE1 and SE2. Here SE = 0.00198

    Then we compute t-stat, p2-p1/SE = 1.72. From the t-table, for degrees of freedom = ∞ (more than 120 is infinity for this table) we find the one-tailed value for p-value 0.05 is 1.645. Since 1.72 > 1.645, we declare statistical significance.

    Now let us say that the offers you sent to were to two Geos, US and EMEA. Let us assume exactly half the number of each offer was sent to each Geo. Let us also assume that we received exactly equal number of responses from each Geo.

    Your p1 and p2 remain the same but your SE1 increases from 0.001276 to 0.001804 and SE2 increases from 0.001516 to 0.002144. So SE increases to 0.0028.

    When you do the t-test for US and EMEA separately the t-stat you compute will be 1.216, less than the 1.645 from the t-table.  In other words, there is no statistically significant difference between the two offers for US and so is the case for EMEA. But when we put these together, we found otherwise.

    You could counter this by saying we collect 5200 samples for each Geo. What if we the segmentation dimensions are not known in advance? What about other demographic and psychographic segmentation?

    Large samples will find statistically significant difference that are in reality not significant at all.

    Another mistake is to quote % difference between versions. It is just wrong to say one version performed better than the other by x%. Note that the alternate hypothesis is p2>p1 and it DOES NOT say anything about by how much. So when we find statistically significant difference, we reject H0 but there is nothing in our hypothesis or the method to say p2 performed better than p1 by x%!

    The A/B Test Is Inconclusive. Now What?

    Required Prior Reading: To make the most out of this article you should read these

    1. Who makes the hypothesis in hypothesis testing?
    2. Solution to NYTimes coin toss puzzle

    So you just did a A/B test between the old version A and a proposed new version B. Your results from 200 observations show version A received 90 and version B received 110. Data analysis says there is no statistically significant difference between the two versions. But you were convinced that version B is better (because of its UI design and your prior knowledge etc.).  So should you give up and not roll out version B? [tweetmeme source=”pricingright”]

    With A/B testing, it is not enough to find that in your limited sampling, Version B performed better than Version A. The difference between the two has to be statistically significant at a preset confidence level . (See Hypothesis testing.)

    It is not all for naught if you do not find statistically significant difference between your Version-A and Version-B. You can still move your analysis forward with some help from this 19th century pastor from England – Thomas Bayes.

    Note: What follows below is a bit heavy on use of conditional probabilities. But if you hung in there, it is well worth it so you do not throw away a profitable version! You could move from being 60% certain to 90% certain that version B is the way to go!

    Before I get to that let us start with statistics that form the basis of A/B test.

    With A/B testing you are using Chi-square test to see if the observed frequency  difference between the two versions are statistically significant. A more detailed and an easy to read explanation can be found here.   You are in fact starting with two hypotheses:

    The Null hypothesis H0: The difference between the versions are just random
    Alternative hypothesis H1: The difference is significant such that one version performs better than the other.

    You also choose (arbitrarily) a confidence level or p-value at which you want the results to be significant. The most common is 95% level or p=0.05. Based on your computed Chi-square value for that p value (lesser than corresponding value for 1 degree of freedom or greater ) you retain H0 or reject H0 and accept H1.

    A/B test results are inconclusive. Now What?
    Let us return to the original problem we started with. Your analysis does not have to end just because of this result, you can move it forward by incorporating your prior knowledge and with help from Thomas Bayes. Bayes theorem lets you find the likelihood your hypothesis will be true in the future given the observed data.
    Suppose you were 6o% confident that version B will perform better. To repeat, we are not saying Version B will get 60%, we are stating that your prior knowledge says you are 60% certain version B should perform better (i.e., the difference is statistically significant). That represents your prior.

    Then with your previously computed Chi-square value, instead of testing whether or not it is significant at p value 0.05, find for what p value the Chi-square value is significant and compute 1-p (Smith and Fletcher).

    In the example I used, p is 0.15 and 1-p is 0.85.  According to Bayes, this is the likelihood that data fit the hypothesis given the hypothesis.

    Then the likelihood your hypothesis will be correct in the future (posterior) is 90%
    (.60 * .85)/(.40*.85+.3*.15)

    From being 60% certain  that version B is better you have moved to being 90% certain that version B is the way to go. You can now decide to go with version B despite inconclusive A/B test.

    If the use of  prior “60% certain that the difference is statistically significant” sounds like subjective, it is. That is why we are not stopping there and improving it with testing. It would help you to read my other article on hypothesis testing to understand that there is subjective information in both classical and Bayesian statistics. While with AB test we treat the probability of hypothesis ( that we set as) 1, in Bayesian we assign it a probability.

    For the analytically inclined, here are the details behind this.

    With your A/B testing you are assuming the hypothesis as given and finding the probability the data fits the hypothesis. That is conditional probability.

    If  P(B) is the probability that version B performs better then P(B|H) is the conditional probability that B occurs given H. With Bayesian statistics you do not do hypothesis testing. You are find the conditional probability that given the data which hypothesis makes sense.  In this case it is P(H|B). This is what you are interested in to decide whether to go with version B or not

    Bayes theorem says

    P(H|B)  =  (Likelihood  *  Prior )/ P(B)
    Likelihood = P(B|H) what we computed as 1 -p  (See Smith and Fltecher)
    Prior = P(H)  – your prior knowledge, the 60% certain we used
    P(B) = P(B|H)*P(H)+ P(B|H-))*(1-P(H))
    P(B|H-) is the probability of B given hypothesis is false. In this model  it is p since we are using P(B|H) = 1-p

    This is a hybrid approach, using classical statistics (the p value you found the  A/B test to be significant) and Bayesian statistics. This is a simplified version than just Bayesian statistics which is computationally intensive and too much for the task at hand for you.

    The net is you are taking the A/B test a step forward despite its inconclusive results and are able to choose the version that is more likely to succeed.

    What is that information worth to you?

    References and Case Studies:

    1. Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. By: Wagenmakers, Eric-Jan; Lodewyckx, Tom; Kuriyal, Himanshu; Grasman, Raoul. Cognitive Psychology, May2010, Vol. 60 Issue 3, p158-189
    2. The Art & Science of Interpreting Market Research Evidence – Hardcover (May 5, 2004) by D. V. L. Smith and J. H. Fletcher
    3. Lilford R.  (2003) Reconciling the quantitative and qualitative traditions: the Bayesian approach . Public Money and Management vol 23, 5, pp 2730284
    4. Retzer J (2006) The century of Bayes. I J of Market Research vol 48, issue 1, pp 49-59

    Metric For Predicting Profitability

    Recently, I analyzed  past 10 years worth of 10-K filings of S&P 500 companies. A very interesting finding from this analysis is a high positive correlation (0.673) between number of times the word “customer” occurs in the 10-K and their profit growth. Every  20% increase in the word “customer” was associated with a 3.75% increase in profit. This finding adds to mounting evidence that companies that are customer focused stand to reap the benefits while those that are not are cast aside.

    No wait. If you have not guessed already, I made all this up.  But let us pretend otherwise and dissect this for analysis errors.

    Correlation does not mean causation: I imply causation with my statement about 20% increase. Yes there may be correlation but it means nothing. It could be due to any number of reasons (lurking variables and Omitted Variable Bias. You can find many other correlations if you looked for it.

    Meaningless metric: I nudge you to think that being customer focused is manifested in the word count.

    Cross-Sectional Analysis: This means I looked across companies and found those with low word-count are associated with low stock growth and high word-count are associated with high profit growth. This ignores all industry specific and firm specific factors.

    Implying Longitudinal Analysis: Longitudinal analysis is following a firm’s performance over years and see if the correlation holds true for a single firm over the years as they increased or decreased the word-count. However by stating, “10 years” I imply as if I just did this analysis.

    Surviorship Bias: The imaginary data set consists of only those companies that are publicly traded and survived for 10 years and I only looked at S&P 500 companies, which are part of S&P 500 because of specific selection criteria . Half of the companies that were in S&P 500 10 years ago are not in it any more. So the sample set is even smaller. What about all other companies that are either private, did not make it or got dropped from S&P 500?

    You probably did not worry about all these errors because the initial claim is so ridiculous that no further dissection was necessary. But not all claims on customer metrics are this obviously ridiculous.

    There lies the danger!

    These usually come neatly packaged and branded, come from some one with “authority”, stated with backing of data and vouched for by their marquee customers. They become extremely popular – accepted as gospel by other marketing Gurus and blogs.  Authority and Acceptance become stand in for truth (Kenneth Galbraith).

    But these claims are susceptible to exactly the same errors I stated above. I urge you to look beyond the surface and fanfare and look at the biases before you embrace the next big metric.

    Making a Case for Practicing Evidence Based Management

    I am repurposing Pascal’s  wager in making a case for evidence based management over – intuitions, gut feel, “blink”, fads and conventional wisdom. [tweetmeme =”pricingright”]

    The net is – In the presence of uncertainties (I treat this as truism) the dominant strategy to pursue for a decision maker is to rely on hard evidence, experiments and analytics.  Nothing more to add to this argument over what Pascal had already said.

    Pas

    Implying Causation – Predictive Analytics Slippery Slope

    Imagine, if you will, a child eating broccoli for the very first time. While eating broccoli, let us say the child sneezes a few times in succession and then proudly declares, “I think I am allergic to broccoli”. As a parent or simply as a grown-up it is not difficult for you to see the fallacy in child’s case. One does need an advanced degree in econometrics or statistics to  reply back, “eat your broccoli – correlation does not imply causation”.  Consider the following real cases:

    1. From The Times Economix Blog:
    2. There’s a very strong positive correlation between income and test scores. (For the math geeks out there, the R2 for each test average/income range chart is about 0.95.)

    3. From The WSJ opinion column:
    4. Study after study reveals that there are long-term career benefits to working as a teenager and that these benefits go well beyond the pay that these youths receive. A study by researchers at Stanford found that those who do not work as teenagers have lower long-term wages and employability even after 10 years.

    5. From WSJ half-page Ads targeting parents
    6. Students who read The Journal are 76% more likely to have a GPA of 36% or higher

    7. From a research paper on subscription to library resources by universities
    8. Working with Dr. Carol Tenopir of the University of Tennessee and consultant Judy Luther of Information Strategies, this single-case study demonstrates a $4.38 grant income for each $1.00 invested by the university in the library (ROI Value). The white paper External link University Investments in Information: What’s the Return? is posted on Library Connect. The results articulate the relationship between the value of research information and its impact on the funding of an institute.

    9. From a research paper from the London School of Economics
    10. In terms of percentage growth, a 7 point increase in word of mouth advocacy (net-promoter score)
      correlated with a 1% increase in growth (1 point increase = .147% more growth). The measurement was done through telephone survey in 2005 and the revenue growth numbers are for 2003-2004.

    Can you spot the fallacies in these claims?  Are these seemingly erudite and well researched claims any different from the claims of a smart child that wants to avoid broccoli? Why do we want to see correlation when none exist or take correlation for causation? Why do we suspend our critical thinking when the results are presented by big brands, big universities and packed with tonnes of data and graphs?

    Of all these cases I listed above, the last one is the winner. Suppose in the chronology of events,  event-2 follows event-1 in time. It is pardonable and a ubiquitous mistake when someone says event-1 might have caused event-2. This is the garden variety correlation causation confusion. But this example I quote says, “event-2 caused event-1”.

    I do not know a word for this!

    True Cost Of New Sales

    Consider the two major “marketing promotions” by the US Government:

    1. The Cash For Clunker program helped sell new cars and spur customer spending during the time when customer confidence is down and people are tightening their belts.
    2. The new Home Buyer tax credit of $8000 helped sell  houses at a time when foreclosures were increasing and sales were slowing.

    These are not that much different from marketing campaigns or promotions run by any business. The Government did it to increase economic activity and businesses do it to increase sales or fill its pipeline.

    One mistake that is common in both cases is in attributing all the sales during the promotional period to the campaign. Marketers need to correctly identify the truly incremental sales (those above and beyond what would have happened without the campaign). This will tell the true cost of incremental sales and whether or not the campaign delivers profitable sales.

    Take the Home Buyer tax credit program. it is $8000 tax credit per buyer. But according to Ted Gayer of Brookings Institution, if you considered only the truly incremental sales, the cost to the Government is $43,000 per buyer. If a campaign is not targeted and selective, it is going to cost a lot more for each new customer acquired.

    Do you know your customer segments and how to target them? Do have the analytics in place to know your campaign effectiveness and cost of new sales?