Base rate my friend, base rate! That said, don’t forget that Nate Silver is a frequentist.
Image from (of course you knew it already): xkcd
Base rate my friend, base rate! That said, don’t forget that Nate Silver is a frequentist.
Image from (of course you knew it already): xkcd
In the movie Madagascar, when Alex, Marty, Melman and Gloria land in the island of Madagascar, their conversation goes like this:
Yeah, here we are. Where exactly is here? San Diego. San Diego? White sandy beaches, cleverly simulated natural environment, wide open enclosures, I'm telling you this could be the San Diego zoo. Complete with fake rocks. Wow! That looks real.
A less forgiving view of Melman’s behavior will be that he started with a preconceived notion and then looked for evidence that supported his notion, ignoring those that would contradict it. Once we have made up your mind our cognitive biases nudge us to only talk to those who would support us and ask only questions that will add credence to our premise. No wonder he did not consider the fact that they were on sea shore and the San Diego zoo doesn’t open up to the sea (and the equatorial climate etc).
This is the same situation we face when we seek and use anecdotes to support our position – we seek what is convenient, available, and be selective about it.
A more accommodating view would be to assume that his statement that they were at San Diego was an hypothesis and he tested his hypothesis by collecting data. The problems I stated above regarding biases and errors in data collection apply here. So this is not a true hypothesis testing.
Same goes for situations when you talk to a few available customers and pick and choose what they say to support your case.
Even if we give it him that the data points he collected were enough and the method was rigorous, there is problem with this approach. We can’t stop when data fit one hypothesis, data can fit any number of hypotheses. If our initial hypothesis is way off and not based any any prior knowledge, we will make the same mistake as Melman – looking for ways to make Madgascar into San Diego.
One way to reduce such errors is to start with better hypotheses – which requires qualitative research and processing prior knowledge. This is the hard part – we get paid for making better hypotheses.
Hypothesis testing, while useful in most scenarios, is still not enough. What Melman found was, given my hypothesis this is San Diego how well does the data fit the hypothesis. But what he should have asked is, “given the data, how likely is this place San Diego?”. This question cannot be answered by traditional hypothesis testing.
There is another way. For that you need to see the sequel, Escape 2 Africa. In the sequel, similar events happen. This time they crash land in Africa as they fly from Madagascar. They end up asking the same question. This time Melman answers the same, “San Diego”, but quickly adds, “This time I am 70% sure”.
A very very generous view of this will be, he is applying Bayesian statistics to improve uncertainty in his prior knowledge using new data. This requires us to treat the initial notion or hypothesis as not certain but as premise with uncertainty associated with it. Then we update the premise and its uncertainty as we uncover new data.
But most of the business world is not ready for it yet. If you are interested in hearing more, drop me a note.
Other articles on Hypothesis testing:
Tags: Customer Metric, Hypothesis
Last time I wrote about the use of prior knowledge in A/B testing there was considerable push back from the analytics community. I think I touched a nerve when I suggested the use of “how confident you were before the test” to interpret the results after the test. While the use of such information may sound like gut-feel and arbitrary, we must recognize that we implicitly use considerable information priors in A/B testing. The Bayesian methods I used just made the implicit assumptions explicit.
When you finally get down to test two (or three) versions with A/B split testing, you have implicitly eliminated many other versions. You should stop and ask why you are not testing every possible combination. The answer is you applied tacit knowledge that you have, either based on your own prior testing or well established best practices and eliminated many versions that required no testing. That is the information prior!
Now let us take this one step further. Of the two versions you selected, make a call on how confident you are that one will perform better than the other. This can be based on prior knowledge about the design elements and user experience or an estimate that is biased. This should not surprise you, after all we all seem to be finding reasons why one performed better than the other after the fact. In fact the latter scenario has hindsight bias whereas I am simply asking you to state your prior expectation of which version will perform better.
Note that I am not asking you to predict by how much, only how confident you are that there will be real (not statistically significant, but economically significant) difference between the two versions. You should write this down, before you start testing and not after (I prefer to call A/B testing as collecting data). As long as the information is obtained through methods other than this test in question, it is a valid prior. It may not be precise but it is valid.
What we have is the application of information priors in A/B testing – valid and relevant.
Next up, I will be asking you get rid of the test for statistical significance and look at A/B testing as a mean to reduce uncertainty in decision making.
Update 8/14/2011: I wrote this article more than a year ago. There is a book out on the very topic of Bayesian reasoning. Times published an article on the book. The article gives a very similar coin-toss problem. You can find the solution here.
A street smart guy called Fat Tony, a by-the-book numbers guy called Dr.John (PhD) and Rev. Thomas Bayes walk into a bar. There they meet a man, Mr. NNT, who shows them a coin and tells them to assume it is fair (equal probability of getting head or tail). Mr. NNT flips the coin 99 times and gets heads each time. He then asks them,
“What are the chances of getting tails in my next toss?”
While you think about your answer, here is a background on these three people. Fat Tony and Dr.John are two imaginary people described in the book The Black Swan (pp124) by Nassim Nicholas Taleb (NNT) who describes them as follows
Fat Tony?Fat Tony’s motto is, “Finding who the sucker is”. He is remarkably gifted in getting free upgrades, getting unlisted phone numbers through his forceful charm. (Fat Tony reminds me of Soprano and his methods)
There is no Thomas Bayes in NNT’s story, he died in the 18th century. His methods, however, are very relevant here.
Back to NNT’s question, which one of the two answers will you agree with?
Dr. John answers 50% because that was the assumption and each toss is independent of the other.
Fat Tony answers no more 1% and believes the coin must be loaded and it can’t be a fair game.
Dr. John follows the science of marketing by numbers to the letter. He applies hypothesis testing, sampling and statistical significance all the time. But, he confuses assumptions with facts. When he starts with an assumption he refuses to look beyond the obvious and refine his knowledge with new data. He sticks to the prior knowledge as given and dismisses events stating otherwise.
Fat Tony has no system. He shoots from the hip. He is the big picture visionary guy. He has been there, done that. He has no prior knowledge nor does he care about analyzing whether data fits theory. He is simply convinced that getting heads 99 times in a row means funny business. He believes plural of anecdotes is data, worse, irrefutable evidence.
In this specific example, Fat Tony is most likely to be correct and he gets it right not because of his superior street skills, gut feel, “Blink” but because of how he always makes decisions. He is correct in questioning the assumption but his methods are not repeatable or teachable.
Until now, I did not say what Thomas Bayes said. His answer was
“I am all but certain (almost 100%) that the next toss will not be tails*”
There is a better way between the street smart, gut-feel ways of Fat Tony who goes by just what what he sees (and he has seen enough) and the rigid number crunching of Dr.John who believes that the assumption holds despite data. That’s the Bayesian way.
Bayesian Marketer does not “assume” the very thing he is trying to prove and does not mistake statistical significance for economic significance. For a Bayesian, assumptions are just that, not irrefutable facts. He does not let the gut decide but lets it guide and effectively uses data to make more informed decisions. He knows he is making decisions under uncertainty and wisely uses experimentation and information combined with his mind to reduce this uncertainty. (see below for the math)
That is the way to confidently pursue strategy in the presence of uncertainties!
How do you make your decisions?
*For those mathematically inclined:
Bayesian does not look at probabilities as ratio of count of events but as a measure of certainty. Bayesian also accounts for uncertainty and does not take the hypothesis as given (i.e., assuming the coin is fair). In this case instead of stating, ” probability of getting heads with a fair coin is 50%”, he states his prior as, ” i am 50% confident I will get tails in a coin toss”
P(C) = 0.5 where C is for his confidence level in getting tail.
The 99 heads we saw are the data D. Bayesian asks given the observed data D, how does my estimate change. That is he asks what is P(C|D)?
P(C|D) = P(D|C) * P(C) / P(D)
P(D) = P(D|C)*P(C) + P(D|Not C)*P(Not C)
P(D|C) is the the chance of getting 99 heads in a row given that coin was fair. That is (1/2) raised to the power of 99, a very low number.
P(D|Not C) is 1- P(D|C) , i.e, you can get this data in every possible scenario except with a fair coin. P(D| Not C) is almost close to 1. Here is a simpler explanation, the coin can be fair in only one way but it can be unfair in any number of ways.
Plugging in the numbers we compute P(C|D) to be close to 0 and hence the refined estimate of confidence.
Hence his answer that says how confident he is about getting tails in the 100th toss. To reiterate unlike Fat Tony or Dr.John, Bayesian does not say what the chances are but how confident he is about the outcome.
For those raising the valid point that the coin could still be fair if you continue to toss it for 10,000 or 10 million times. Yes, you are correct and Bayesian will continue to refine his uncertainty as new information comes in. He does not stop with initial information.
Required Prior Reading: To make the most out of this article you should read these
So you just did a A/B test between the old version A and a proposed new version B. Your results from 200 observations show version A received 90 and version B received 110. Data analysis says there is no statistically significant difference between the two versions. But you were convinced that version B is better (because of its UI design and your prior knowledge etc.). So should you give up and not roll out version B?
With A/B testing, it is not enough to find that in your limited sampling, Version B performed better than Version A. The difference between the two has to be statistically significant at a preset confidence level . (See Hypothesis testing.)
It is not all for naught if you do not find statistically significant difference between your Version-A and Version-B. You can still move your analysis forward with some help from this 19th century pastor from England – Thomas Bayes.
Note: What follows below is a bit heavy on use of conditional probabilities. But if you hung in there, it is well worth it so you do not throw away a profitable version! You could move from being 60% certain to 90% certain that version B is the way to go!
Before I get to that let us start with statistics that form the basis of A/B test.
With A/B testing you are using Chi-square test to see if the observed frequency difference between the two versions are statistically significant. A more detailed and an easy to read explanation can be found here. You are in fact starting with two hypotheses:
The Null hypothesis H0: The difference between the versions are just random
Alternative hypothesis H1: The difference is significant such that one version performs better than the other.
You also choose (arbitrarily) a confidence level or p-value at which you want the results to be significant. The most common is 95% level or p=0.05. Based on your computed Chi-square value for that p value (lesser than corresponding value for 1 degree of freedom or greater ) you retain H0 or reject H0 and accept H1.
A/B test results are inconclusive. Now What?
Let us return to the original problem we started with. Your analysis does not have to end just because of this result, you can move it forward by incorporating your prior knowledge and with help from Thomas Bayes. Bayes theorem lets you find the likelihood your hypothesis will be true in the future given the observed data.
Suppose you were 6o% confident that version B will perform better. To repeat, we are not saying Version B will get 60%, we are stating that your prior knowledge says you are 60% certain version B should perform better (i.e., the difference is statistically significant). That represents your prior.
Then with your previously computed Chi-square value, instead of testing whether or not it is significant at p value 0.05, find for what p value the Chi-square value is significant and compute 1-p (Smith and Fletcher).
In the example I used, p is 0.15 and 1-p is 0.85. According to Bayes, this is the likelihood that data fit the hypothesis given the hypothesis.
Then the likelihood your hypothesis will be correct in the future (posterior) is 90%
(.60 * .85)/(.40*.85+.3*.15)
From being 60% certain that version B is better you have moved to being 90% certain that version B is the way to go. You can now decide to go with version B despite inconclusive A/B test.
If the use of prior “60% certain that the difference is statistically significant” sounds like subjective, it is. That is why we are not stopping there and improving it with testing. It would help you to read my other article on hypothesis testing to understand that there is subjective information in both classical and Bayesian statistics. While with AB test we treat the probability of hypothesis ( that we set as) 1, in Bayesian we assign it a probability.
For the analytically inclined, here are the details behind this.
With your A/B testing you are assuming the hypothesis as given and finding the probability the data fits the hypothesis. That is conditional probability.
If P(B) is the probability that version B performs better then P(B|H) is the conditional probability that B occurs given H. With Bayesian statistics you do not do hypothesis testing. You are find the conditional probability that given the data which hypothesis makes sense. In this case it is P(H|B). This is what you are interested in to decide whether to go with version B or not
Bayes theorem says
P(H|B) = (Likelihood * Prior )/ P(B)
Likelihood = P(B|H) what we computed as 1 -p (See Smith and Fltecher)
Prior = P(H) – your prior knowledge, the 60% certain we used
P(B) = P(B|H)*P(H)+ P(B|H-))*(1-P(H))
P(B|H-) is the probability of B given hypothesis is false. In this model it is p since we are using P(B|H) = 1-p
This is a hybrid approach, using classical statistics (the p value you found the A/B test to be significant) and Bayesian statistics. This is a simplified version than just Bayesian statistics which is computationally intensive and too much for the task at hand for you.
The net is you are taking the A/B test a step forward despite its inconclusive results and are able to choose the version that is more likely to succeed.
What is that information worth to you?
References and Case Studies: