8 Flaws in A/B Split Testing

You have been using A/B split testing to improve your mail campaigns and web designs. The core idea is to randomly assign participants to group A or B and measure the resulting performance – usually in terms of conversion. Then perform statistical testing, either t-test (incorrect) or Chi-square test to see if the difference in performance between A and B is statistically significant at 95% confidence level.

There are  significant flaws with this approach:

  1. Large Samples: Use of large samples that are most likely to find statistical significance even for small differences. When using large samples (larger than 300) you lose segmentation differences.
  2. Focus on Statistical Significance: Every test tool, sample size calculator and articles are narrowly focused on achieving statistical significance, treating that as final word on the superiority of one version over.
  3. Ignoring Economic Significance: There may be statistical significance or not, but no test tool will tell you the economic significance of that for your decision making.
  4. Misleading Metrics: When tools report Version A is X% better than version B, they are simply wrong. The hypothesis testing used in A/B testing is simply one version is better than other and not by what percent.
  5. All or nothing: When the test results are inconclusive, there is nothing to learn from these tests.
  6. Discontinuous: There is no carryover of knowledge gained from previous tests. We do not apply any knowledge gained from a test in later tests.
  7. Test Everything and Test Often: The method wrests control from the decision maker in the name of “data driven”. This pushes one to suspend all prior knowledge (because these are considered hunches and intuition) and test every thing and test often, resulting in significant costs for minor improvements. Realize that the test tool makers are incentivized by your regular and excessive testing.
  8. Mistaking X implies Y is same as Y implies X: The hypothesis testing is flawed. What we test is, “how well does the data fit the hypothesis that we assumed”. But at the end of the test we state, “the hypothesis is supported by the data and is true for all future data”.

The root cause of all the mistakes is in using A/B testing for decision making. When you are deciding between two versions you are deciding which option will deliver you better returns. The uncertainty is in deciding the version. If there is no uncertainty at all, why bother?

The way to reduce uncertainty is to collect relevant information. It is profitable to do so only if the cost to collect this information is less than the expected increase in return from reducing the uncertainty.

You are not in the hypothesis testing business. You are in the business of adding value to your shareholders (that is you, your investors). To deliver value you need to make decisions in the presence of uncertainties.  With all its flaws, A/B testing is not the right solution for decision making!

So stop using A/B testing!

What do I recommend? Send me a note to read a preview of my article on “Iterative Bayesian (TM)”.

Let us hunt for something interesting in this data gold mine

How many times have you heard this?

We are collecting a lot of data on our customers/transactions/sales/logs,  let us look at this goldmine to see if we can find anything interesting .

The problem with seeking something interesting is you are bound to find it. You might call the next statement tautological, but the fact is if it is interesting you are bound to find.  To a determined data-miner any interesting statistical outlier will eventually show up, then it is simply writing the hypothesis prediction.

Data mining or as some might call it data trolling is looking for  patterns from data sitting around, as opposed to deliberate decision making which requires seeking specific information for reducing uncertainty. But data can fit any number of hypotheses. However mining for a cause, we are bound to pick

  • Ones that are most convenient, like the man searching for lost key under the light.
  • Ones that are familiar, based on our past experience and our beliefs – well there are many fables about this.

The way to make informed decisions is to frame hypothesis based on the best prior knowledge we have. Know that this is just an hypothesis, not a fact and has uncertainties associated with it. Then collect specific data to refine it and reduce the uncertainty.

We will never know all the facts with certainty, but if we realize that what we know has uncertainty associated with it and there could be far more that we do not yet know, we are on the right track.

How do you make your decisions?

The Role of Information is to Reduce Uncertainty

Why do we need to do Marketing research, collect analytics,  perform A/B testing, and conduct experiments?

  1. To find out whether  the Highest paid person’s opinion (HiPPO) is true?
  2. To pick the clear winning option?
  3. To satisfy our ego that we drive decisions based on analytics?

The real purpose of  all these methods of data collection is to reduce uncertainty in our decision making. Decision making after all is about making choices. If there are no choices or you have already made your choice, then there is no real decision making.

If you have options but are not certain which one to go with, then there is uncertainty, or more precisely there is an unacceptable level of uncertainty. If it were acceptable, that is the expected results are not that different, then there is no decision making as well. Just flip a coin and go with it (My article from 2009.)

If the level of uncertainty is unacceptable,  that is choosing the wrong option will mean the difference between life and death or profit and loss – then it may be worth it to reduce this uncertainty provided the cost to get this information is less than the value differential.

Conversely, if the information you have or collect does nothing to reduce uncertainty in decision making then it is irrelevant regardless of how plentiful it is, how statistically significant it is, and how easy or cheap it is to collect it.

How do you make your decisions?

Why do you collect information?

Looking Beyond the Obvious – Gut, Mind and in Between

Update 8/14/2011: I wrote this article more than a year ago. There is a book out on the very topic of Bayesian reasoning.  Times published an article on the book. The article gives a very similar coin-toss problem. You can find the solution here.

A street smart guy called Fat Tony, a by-the-book numbers guy   called Dr.John (PhD) and Rev. Thomas Bayes walk into a bar. There they meet a man, Mr. NNT, who shows them a coin and tells them to assume  it is fair (equal probability of getting head or tail). Mr. NNT  flips the coin 99 times and gets heads each time. He then asks them,

“What are the chances of getting tails in my next toss?”

While you think about your answer, here is a background on these three people. Fat Tony and Dr.John are two imaginary people described in the book The Black Swan (pp124) by Nassim Nicholas Taleb (NNT) who describes them as follows

Fat Tony?Fat Tony’s motto is, “Finding who the sucker is”. He is remarkably gifted in getting free upgrades, getting unlisted phone numbers through his forceful charm. (Fat Tony reminds me of Soprano and his methods)

Dr. John is a painstaking, reasoned and gentle fellow who knows computers and statistics and works for an insurance company. (Dr. John reminds me of Data, with rigid rules)

There is no Thomas Bayes in NNT’s story, he died in the 18th century. His methods, however, are very relevant here.

Back to NNT’s question, which one of the two answers will you agree with?
Dr. John answers  50% because that was the assumption and each toss is independent of the other.
Fat Tony answers no more 1% and believes the coin must be loaded and it can’t be a fair game.

Dr. John follows the science of marketing by numbers to the letter. He applies hypothesis testing, sampling and  statistical significance all the time. But, he confuses assumptions with facts. When he starts with an assumption he refuses to look beyond the obvious and refine his knowledge with new data. He sticks to the prior knowledge as given and dismisses events stating otherwise.

Fat Tony has no system. He shoots from the hip. He is the big picture visionary guy.  He has been there, done that. He has no prior knowledge nor does he care about analyzing whether data fits theory.  He is simply convinced that getting heads 99 times in a row means funny business. He believes plural of anecdotes is data, worse, irrefutable evidence.

In this specific example, Fat Tony is most likely to be correct and he gets it right  not because of his superior street skills, gut feel, “Blink” but because of how he always makes decisions. He is correct in questioning the assumption but his methods are not repeatable or teachable.

Until now, I did not say what Thomas Bayes said. His answer was

“I am all but certain (almost 100%)  that the next toss will not be tails*”

There is a better way between the  street smart, gut-feel ways of Fat Tony who goes by just what what he sees (and he has seen enough) and the rigid number crunching of Dr.John who believes that the assumption holds despite data.  That’s the Bayesian way.

Bayesian Marketer  does not “assume” the very thing he is trying to prove and does not mistake statistical significance for economic significance. For a Bayesian, assumptions are just that, not irrefutable facts. He does not let the gut decide but lets it guide and effectively uses data to make more informed decisions. He knows he is making decisions under uncertainty and wisely uses experimentation and information  combined with his mind to reduce this uncertainty. (see below for the math)

That is the way to confidently pursue strategy in the presence of uncertainties!

How do you make your  decisions?


*For those mathematically inclined:

Bayesian does not look at probabilities as ratio of count of events but as a measure of certainty.  Bayesian also accounts for uncertainty and does not take the hypothesis as given (i.e., assuming the coin is fair). In this case instead of stating, ” probability of getting  heads with a fair coin is 50%”, he states his prior as, ” i am 50% confident I will get tails in a coin toss”

P(C) = 0.5   where C is for his confidence level in getting tail.
The 99 heads we saw are the data D. Bayesian asks given the observed data D, how does my estimate change. That is  he asks what is P(C|D)?

P(C|D) =  P(D|C) * P(C) / P(D)

P(D) = P(D|C)*P(C) + P(D|Not C)*P(Not C)
P(D|C) is the the chance of getting 99 heads in a row given that coin was fair. That is (1/2) raised to the power of 99, a very low number.
P(D|Not C)  is 1- P(D|C) , i.e, you can get this data in every possible scenario except with a fair coin. P(D| Not C) is almost close to 1. Here is a simpler explanation, the coin can be fair in only one way but it can be unfair in any number of ways.

Plugging in the numbers we compute  P(C|D) to be close to 0 and hence the refined estimate of confidence.

Hence his answer that says how confident he is about getting tails in the 100th toss. To reiterate unlike Fat Tony or Dr.John, Bayesian does not say what the chances are but how confident he is about the outcome.

For those raising the valid point that the coin could still be fair if you continue to toss it for 10,000 or 10 million times. Yes, you are correct and Bayesian will continue to refine his uncertainty as new information comes in. He does not stop with initial information.

Numbers Coming Back to Haunt

Why is there so much indifference and/or push-back to using numbers for driving decisions?

Why are experience and intuitions preferred over analytics?

Why are the subject matter experts trusted over hard data?

Why are the proponents of data and analytics not only ignored but have their motives questioned?

The answer to all these questions is: Self interest.

Here is a quote from the book  Super Crunchers that nails it:

Traditional experts don’t like the loss of control and status that accompanies a shift toward Super Crunching (Evidence based management). But some of the resistance is more visceral. Some people fear numbers. For these people, Super Crunching is their worst nightmare. To them, the spread of data-driven decision making is just the kind of think they thought they could avoid by majoring in the humanities and then studying something nice and verbal, like law.

The greater its [data driven decision making] impact the greater is the resistance.

Bye Bye Mr.Memory! Hello Mr. Insight

In the Alfred Hitchcock film 39 Steps, the opening scene features a stand-up act by a man introduced to us as Mr.Memory. People paid money to come to this show. He was someone who had committed at least 50 facts per day into his memory and could answer any audience question. The questions range from the distance between Winnipeg to Manitoba to baseball statistics. Today we do not need Mr.Memory nor do we  appreciate committing facts into memory. We have Google, or bing or the next big search engine.

If you look closely at Mr.Memory’s act it was still an entertainment act. If it was a rote regurgitation of facts the audience would not have paid good money to get there. He was witty and the audience was laughing.  Mr.Memory would have bombed if the audience were bored or laughing at him instead of at his jokes.

Google or not,  data, information and facts have  always been available to those sought them. Data might not have been free or there were transaction costs but was available. People protected data as if the value is intrinsic to the data. Value  is not intrinsic to data. Value is created from the insights one derives from these to serve a market and gain upper hand over the competition.

There is a quote that was attributed to Sam Walton (I cannot verify the authenticity): “I am not so much afraid of someone stealing my data as someone can make better decisions with it than I can”.  Whether or not Sam Walton said this the statement holds true.

Mr. Memory  may not have  job today but he knew then that his advantage came from doing something different with the information – delivering entertainment that competed for customer wallet share against other forms of entertainment.

Do you, as a decision maker, just seek data for its own sake or create actionable insights that deliver profits?