Tag Archives: Statistical Significance

Testing 40 shades of blue – AB Testing

The title refers to the famous anecdote about Marissa Mayer testing 40 shades of blue to determine the right color for the links. (Unfortunately I am colorblind, I know just one blue.)

Mayer is famous for many things at Google, but the one that always sticks out – and defines her in some ways – is the “Forty Shades of Blue” episode.

she ordered that 40 different shades of blue would be randomly shown to each 2.5% of visitors; Google would note which colour earned more clicks. And that was how the blue colour you see in Google Mail and on the Google page was chosen.

Thousands of such tests happen in the web world, every website running multiple experiments in a day. Contrary to what most in webapp development may believe AB testing does not have its origins in webapp world. It is simply an application of statistical testing, Randomized Control Trial, to determine if a ‘treatment’ made a difference on the performance of treatment group compared to performance of control group.

The simplest test is testing if the observed difference between the two sample means are statistically significant. What that means is measuring the probability, p-value, the difference is just random. If p-value is less than a preset level we declare the treatment made a difference.

Does it matter if the results are statistically significant? See here why it does not:

“I have published about 800 papers in peer-reviewed journals and every single one of them stands and falls with the p-value. And now here I find a p-value of 0.0001, and this is, to my way of thinking, a completely nonsensical relation.”

Should you test 40 shades of blue to find the one that produces most click-thrus or conversions? xkcd has the answer:

Can Ms. Mayer test the way out of Yahoo’s current condition? Remember all these split testing are about finding lower hanging fruits not quantum leaps. And as Jim Manzi wrote in his book Uncontrolled,

Perhaps the single most important lesson I learned in commercial experimentation, and that I have since seen reinforced in one social science discipline after another, is that there is no magic. I mean this in a couple of senses. First, we are unlikely to discover some social intervention that is the moral equivalent of polio vaccine. There are probably very few such silver bullets out there to be found. And second, experimental science in these fields creates only marginal improvements. A failing company with a poor strategy cannot blindly experiment its way to success …

You can’t make up for poor strategy with incessant experimentation.

Making up things and supporting with faulty analysis

Update 7/11/202: I took a harsher stance against Mr. Jamison’s article. As I communicate with him over email and see his willingness to share data and refine his model I see my comments as little harsh. Instead of updating them I will leave them for what they are so you can also judge my writing.

How do VCs decide to pass on an startup? If you were to read a TechCrunch article you will find a quantitative model supported by statistical analysis:

Likelihood of Receiving Term Sheet = -0.355  +
0.349 (Team) +
0.334 (Market) +
0.222 (Traction) +
0.029 (Product)

A nice linear regression model with an R2 value of 0.5 that states Likelihood of getting Term Sheet as a function of four attributes. This article and the regression model comes from a partner in a VC firm, Mr. Jay Jamison.

Sounds plausible?  Fits your notion that VCs invest in teams and not product? Is the fact that this is a regression analysis done by a VC partner enough to convince you to suspend your disbelief and accept this predictive model at face value? Or are you going to walk up to the stage and tell the magician that you are not satisfied with his lift shuffle and you are going to do it yourself?

Let us do the latter and while we are up on the stage let us ask the magician to roll up the sleeves as well.

How did he build the model? Mr. Jamison, the author, said he rated each pitch on the five dimensions on a scale of 1 to 5. He explains more on how he defined the rating in his blog. Let us assume that it is interval scale to run Multiple Linear Regression (OLS – Ordinary Least Squares).

Now, what are the problems with this predictive model?

  1. How reliable is the data? Mr.Jamison collected 200 startup pitches available to him (not random sampling mind you) and ex post gave the rating. That is, these are NOT the ratings his firm gave on these dimensions at the time of the pitch but done by Mr.Jamison now just for the purpose of this analysis.
    That is a biased sample with flawed measurement. You can stop right here and call him out. The rest of the article and his claims based on the regression analysis are point less.
  2. How good is  the model? . A multiple regression model is measured by two metrics. One,  R2  which is the strength of the relation between the explanatory variables and the dependent variable and two a measure of whether each variable’s relation is statistically significant (p-value < 0.05)
    This model has an R2 value of 0.5.  This means 50% of the changes in Liklihood  (the Right Hand Side variable) can be explained by changes in these four variables. But is each explanatory variable’s relation statistically significant? Mr. Jamison does not provide us t-stat (or p-value) data for us. This is likely because he simply ran the regression with all the variables and reported just the R2 .
    If one were to use the simplistic Excel’s DataAnalysis tool to run multiple-regression that is what one will get. In essence, we do not know how many of the three variables really have any effect on the Likelihood of Receiving Term Sheet.
    The right way to do the regression is to enter variables one at a time and see if its relation is statistically significant and if the R2 value changes with the addition of variable to the model. It is possible only one of the variable is relevant and its  Rcould be much lower than 50%.
    So all the explanations on importance of Team, Market, Traction that Mr.Jamison provides are irrelevant because they are  based on faulty analysis.
  3. About the use of term Likelihood: It is misleading as I first thought he was really measuring Likelihood using Logistic regression. It is OLS where he models Likelihood on a 5 point scale. That rating is quite meaningless: it is simply a binary variable, whether he extended term sheet or not. In which case he should be running Logistic Regression which measures the probability that a startup will get term sheet given the values of four explanatory variables.

Even if the model did not have any of these errors, there are still lurking variables. Regression is not causation despite the equation form. It is still correlation and there are many lurking variables including who introduced the startup for the pitch and whether the VCs identify themselves with the startup founders.

What this really means is VCs don’t have any real model for evaluating startups.  Consider this – if we took this raw data, stripped out the Likelihood variable and asked VCs (in general) to rate the likelihood, how different are these going to be from VC to VC and how different will these ratings be from one done based on coin-toss?

It would have been interesting if VCs had a scoring system for these four attributes and other dimensions,  as a team rated the startups right after the pitch and agreed to extend term sheet to only those that reached certain threshold.

But what we have here  is faulty data and analysis used to color gut calls as quantitative.

Are you going to willingly suspend your disbelief? Or …

My First GitHub Commit – Select Random Tweets

I have seen several reports that collect and analyze millions of tweets and tease out a finding from it. The problem with these reports is they do not start with any hypothesis and find the very hypothesis they are claiming to be true by looking at large volumes of data. Just because we have  Big Data, it does not mean we can suspend application of mind.

In the case of twitter analysis, only those with API skills had access to data. And they applied well their API skills to collect every possible tweet to make their prediction that are nothing more than statistical anomalies. Given millions of tweets, anything that looks interesting will catch the eye of a determined programmer seeking to sensationalize his findings.

I believe in random sampling. I believe in reasonable sample sizes, not whale of a sample size. I believe that abundance of data does not obviate the need for theory or mind. I am not alone here. I posit that the any relevant and actionable insight can only come from starting with relevant hypothesis based on prior knowledge and then using random sampling to test the hypothesis. You do not need millions of tweets for hypothesis testing!

To make it easier for those seeking twitter data to test their hypotheses I am making available a real simply script to select tweets at random. You can find the code at GitHub. You can easily change it to do any type of randomization and search queries. For instance you want to select random tweets that mention Justin Bieber, you can do that.

The script has bugs? I likely does. Surely others can pitch in to fix it.

Relevance not Abundance!

Small samples and test for statistical significance than all the data and statistical anomalies.

Does preschool lead to career success?

If you are reading this article it is highly likely your child has been in preschool or will attend preschool. But pick randomly any child from US population, you will find that only 50% chance the child goes to preschool.

The rest either stay home, where they play with parents or caregivers, or attend daycare, which may not have an educational component. Preschool isn’t mandatory, and in most places it’s not free. (Source : WSJ)

What is the observed difference in their later performance of those who attended preschool and those who didn’t?

According to Dr. Celia Ayala, research says preschool attendance points to stellar career.  She said,

“Those who go to preschool will go on to university, will have a graduate education, and their income level will radically improve,”

50% of children don’t get to attend preschool because of economic disparity. Seems only fair to democratize the opportunities for these children and provide them free preschool when their parents can’t afford them.

I do not have a stand on fairness but I have a position on the reported research and how they drew such a causation conclusion.

First I cannot make judgement on a research when someone simply says, “research says”, without producing the work, the data that went into it and the publication. Let us look at two possible ways the said research could have been conducted.

Cross-sectional Analysis – Grab a random sample of successful and unsuccessful adults and see if there is statistically significant difference in the number of those who attended preschool.  As a smart and analytically minded reader you can see the problem with cross-sectional studies. It cannot account for all different factors and confuses correlation with causation.

Longitudinal Analysis – This means studying over a period of time. Start with some preschoolers and some not in preschool and track their progress through education, college and career.  If there is statistically significant difference then you could say preschool helped. But you, the savvy reader, can see the same problems persist.  Most significantly it ignores the effect of parents – both their financial status and genes.
A parent who enrolls the child in preschool is more likely to be involved in every step of their growth. Even if you discount that, the child is simply lucky to start with smart parents.

So the research in essence is not actionable. Using it to divert resources to invest in providing preschool opportunity to those who cannot afford is not only misguided but also overlooks opportunity cost of the capital.

What if the resources could actually help shore up elementary, middle or high-school in low-income neighborhood? Or provide supplementary classes to those who are falling behind.

Failing to question the research, neglecting opportunity costs and blindly shoveling resources on moving a single metric will only result in moving the metric but with no tangible results.

Where do you stand?

Demand Validation – Don’t stop with what is easy, available and fits your hypothesis

As days get hotter, customers line up at Good Humor ice cream trucks only to be disappointed to find that their favorite ice cream, Toasted Almond Bar, is no more available. Truck after truck, customer after customer, similar story. Customers cannot believe the truck  does not any more carry their favorite product. (Full story here)

What is wrong with the business that does not know its own customers and their needs?

Why are they refusing to heed the validation they get from the ice cream trucks (their distribution channel) who are outside the corporate building and with the customers?

This is not because Unilever that owns the Good Humor brand is not customer centric but because it is looking at aggregate customer demand, not just handful of customer inputs. These anecdotes about disappointed customers are just that, anecdotes and do not provide demand validation.

One, two,…, hundred people walking up and demanding a product is not enough. When Unilever looks at its flavor mix, the hero of this story is actually the least popular, bringing in only 3% of the sales. Their data shows that the almond bar is popular only in Northeast especially among grown-ups (see footnote on segmentation).

Talking to handful of grownups from Northeast, just because these were the only ones available (like talking to few people in Coupa cafe in Palo Alto) is not demand validation.  These anecdotes can only help you frame better hypothesis about about customer needs and not proof for the hypothesis itself.

Even if you were to pick 100 grownups from Northeast (good enough sample size that will provide 95% confident answer at 10% margin of error),  you are going to end up with wrong answer about your customers. (Because you are not doing random sampling from your entire target segment.)

When it comes to demand validation do get out of the building. But when you return don’t go building almond bars because a few grownups in your Northeast neighborhood (or others at a boot-camp ) said so. You have some serious market analysis work to do.


Note on Segmentation: ‘Grownups in Northeast’ is not a segment. This is a measure of their customer mix. We still do not know why these people love this specific flavor.

It is likely better to speak in absolutes

You read only interesting findings because only those get published, get written about and popularized in social media. Experiments that find no statistically significant difference don’t leave the filing cabinets of researchers because no one wants to read a story where nothing happens. This is such an experiment, where there was not enough evidence to reject the null hypothesis.

Let us start at the beginning. This experiment is about people’s perception of a person’s competence based on whether the person speaks in absolutes with no room for alternatives or whether the person speaks in terms of likelihood, accounting for alternative explanations.

There are several examples of those who speak in absolutes with no self-doubt. Read any CEO interview (enterprise or startup), management guru’s book or Seth Godin’s blog. Examples are,

“Revenue grew because of our marketing”
“Sales fell because of Europe”
“Groupon works, it really works”

An example of speaking in terms of likelihood comes from Nobel laureates in economics,

“Answers to questions like that require careful thinking and a lot of data analysis. The answers are not likely to be simple.”

Hypotheses: You do start with hypotheses before any data analysis don’t you?

Here are the hypotheses I had about speaking in absolutes/likelihoods and perception of competence.

H1: Business leaders are judged to be more competent when they speak in absolutes. Conversely, using terms like “likely” may be perceived as wishy-washy and hence signal incompetence.

H2: Scientists are judged to be more competent when they use likelihoods and avoid absolutes. (Because scientists are expected to think about all aspects and anyone who zones in on one factor must not know how to think about acenarios)

Of course the null hypothesis is there is no statistically significant difference in perception of competence based on whether the subject in question speaks in absolutes or in likelihoods.

Experiment Design: So I designed a simple 2X2 experiment, using SurveyGizmo. You can see the four groups, Company Executive and Scientist as one dimension, Absolutes and Likelihoods on the other. I designed a set of 4 statements with these combinations. When people clicked on the survey they were randomly shown one of the four options.

Here is one of the four statements

This was a very generic statement meant to speak about results and what could have caused it. I avoided specific statements because people’s domain knowledge and preconceived notions come into play. For example, if I had used a statement about lean startup or social media it would have resulted in significant bias in people’s answers.

Based on just one statement, without context, people were asked to rate the competence of the person. Some saw this about Scientists, some about a Company Executive.

Note that an alternate design is to show both Absolute and Likelihood statement and ask the respondents to pick the one they believe to be more competent. I believe that would lead to experimental bias as people may start to interpret the difference between two statements.

Results:  I collected  130 responses, almost evenly split between four groups and did t-test on the mean rating between the groups (Scientists: Absolute/Likelihood, Executive: Absolute/Likelihood, Absolute: Executive/Scientist, Likelihood: Executive/Scientist). And you likely guessed the results from my opening statements.

There is not enough evidence to reject the null hypothesis in all the different tests. That means and difference we see in competence perception of those speaking in absolutes and likelihoods is just random.

What does this mean to you?

Speaking in absolutes, a desired trait that leaders cultivate to be seen as competent and decisive leader, has no positive effect. Including uncertainties does not hurt either.

So go right ahead and present simplistic one size fits all solutions without self-doubt.  After all stopping to think about alternatives and uncertainties only takes time and hurts ones brain with no positive effect on the audience.

Caveats: While competence is not an issue I believe trust perception could be different. That requires another experiment.