Iterative Path

Marketing Strategy and Pricing


Statistical Significance

Dim Light and Ambient Noise Increase Creativity … Not among skeptics

Here are two studies that got lots of social media mileage this week

  1. Dimming lights enhances creativity
  2. Right amount of ambient noise increases creativity

Quoting from the light study, Salon writes

during that all-important phase when you’re grasping for ideas, dim light appears to be a catalyst for creativity. “Darkness changes a room’s visual message,” the researchers explain, and one such message seems to be it’s safe to explore widely and let your imagination run free.

And Salon goes on to make general prescription for all of us

So if you’re struggling to finish that screenplay or come up with the next must-have app, you might try illuminating your workspace with one bare bulb of minimal wattage.s

Quoting from the ambient noise study,  99u writes

 moderate level of background noise creates just enough distraction to break people out of their patterns of thinking and nudge them to let their imagination wander, while still keeping them from losing their focus on the project all together. This distracted focus helps enhance your creativity.

And their recommendations are to roam about in hotel lobbies and coffee shop. If you cannot roam about they have Apps for that (with names like Coffitivity)

Before you give a second look to these studies, despite the fact that these are published in peer reviewed journals and found statistically significant difference, stop and ask some critical questions.

  1. Are the effects additive? So would it work better if we roamed about in dim coffee shops?
  2. The ambient noise study reports difference between 50dB and 70dB. How many other pairs did they test and reject before reporting on a pair that had statistically significant difference? (See green jelly beans cause acne). Only those sound levels that showed difference get reported and the rest get filed in the round filing cabinet. And journals like publishing only those studies that find statistically significant difference. Remember that when motivated researchers keep looking for something interesting they are bound to find it.
  3. What is the measure of creativity? Is that correct and relevant? The ambient noise study used Remote Associates Test while the dim light study used Creative Insights Problem. Why two different metrics by the two studies? Did they try all kinds of measures and picked the one that showed statistically significant difference? If you repeated the Dim Light experiment with Remote Associates Test and Ambient Noise experiment with Creative Insights Problem will the results hold?
  4. Let us say all these are answered in favor of the tests. Does that mean the results translate into real world that has far too many variables? How many hidden hypothesis did the two researchers took for granted in coming up with their results? How many of those will be violated in the real world?
  5. Does statistical significance mean economic significance? What is the economic impact of any perceived difference?
  6. Do you have means to measure creativity of your team that are based on real life results and  do not involve administering Remote Associates Test or Creative Insights Problem? Performance in tests like these is rarely an indication of actual job performance, as Google found out about brainteasers in job interviews.
  7. Don’t forget opportunity cost. You see here two recipes for improving creativity, you will find more if you look for them. All peer reviewed, I bet. Which one can you afford to pick? Or could you be investing your time and resources in something else instead of dimming lights and creating coffee shop noise?
  8. Even if you go on to dim lights or create coffee shop noise there is always Hawthorne effect.  Either your team will react to it, tell you that they are seeing improvement or you will convince yourself you are seeing improvement because you do not want to be wrong about your decision to pipe coffee shop noise through office speakers.

Finally it behooves us to keep in mind the great academic fraud of Diederik Stapel.  I am not saying anything remotely close to what Stapel did happened in these two studies. But I am urging you to start with high skepticism and place the further burden of proof of applications on the researchers and on those writing derivative works based on the research.

Your opportunity cost of internalizing these studies and acting on them is far too high. You don’t want to bump into other Salon readers wandering in dim and noisy hotel lobbies.

If you want an excuse to get your work done in a coffee shop, do it. But do not try to justify it with scientific reason.

Finally, if you are allowed to believe in only one of the two studies which one will you believe?

Testing 40 shades of blue – AB Testing

The title refers to the famous anecdote about Marissa Mayer testing 40 shades of blue to determine the right color for the links. (Unfortunately I am colorblind, I know just one blue.)

Mayer is famous for many things at Google, but the one that always sticks out – and defines her in some ways – is the “Forty Shades of Blue” episode.

she ordered that 40 different shades of blue would be randomly shown to each 2.5% of visitors; Google would note which colour earned more clicks. And that was how the blue colour you see in Google Mail and on the Google page was chosen.

Thousands of such tests happen in the web world, every website running multiple experiments in a day. Contrary to what most in webapp development may believe AB testing does not have its origins in webapp world. It is simply an application of statistical testing, Randomized Control Trial, to determine if a ‘treatment’ made a difference on the performance of treatment group compared to performance of control group.

The simplest test is testing if the observed difference between the two sample means are statistically significant. What that means is measuring the probability, p-value, the difference is just random. If p-value is less than a preset level we declare the treatment made a difference.

Does it matter if the results are statistically significant? See here why it does not:

“I have published about 800 papers in peer-reviewed journals and every single one of them stands and falls with the p-value. And now here I find a p-value of 0.0001, and this is, to my way of thinking, a completely nonsensical relation.”

Should you test 40 shades of blue to find the one that produces most click-thrus or conversions? xkcd has the answer:

Can Ms. Mayer test the way out of Yahoo’s current condition? Remember all these split testing are about finding lower hanging fruits not quantum leaps. And as Jim Manzi wrote in his book Uncontrolled,

Perhaps the single most important lesson I learned in commercial experimentation, and that I have since seen reinforced in one social science discipline after another, is that there is no magic. I mean this in a couple of senses. First, we are unlikely to discover some social intervention that is the moral equivalent of polio vaccine. There are probably very few such silver bullets out there to be found. And second, experimental science in these fields creates only marginal improvements. A failing company with a poor strategy cannot blindly experiment its way to success …

You can’t make up for poor strategy with incessant experimentation.

Making up things and supporting with faulty analysis

Update 7/11/202: I took a harsher stance against Mr. Jamison’s article. As I communicate with him over email and see his willingness to share data and refine his model I see my comments as little harsh. Instead of updating them I will leave them for what they are so you can also judge my writing.

How do VCs decide to pass on an startup? If you were to read a TechCrunch article you will find a quantitative model supported by statistical analysis:

Likelihood of Receiving Term Sheet = -0.355  +
0.349 (Team) +
0.334 (Market) +
0.222 (Traction) +
0.029 (Product)

A nice linear regression model with an R2 value of 0.5 that states Likelihood of getting Term Sheet as a function of four attributes. This article and the regression model comes from a partner in a VC firm, Mr. Jay Jamison.

Sounds plausible?  Fits your notion that VCs invest in teams and not product? Is the fact that this is a regression analysis done by a VC partner enough to convince you to suspend your disbelief and accept this predictive model at face value? Or are you going to walk up to the stage and tell the magician that you are not satisfied with his lift shuffle and you are going to do it yourself?

Let us do the latter and while we are up on the stage let us ask the magician to roll up the sleeves as well.

How did he build the model? Mr. Jamison, the author, said he rated each pitch on the five dimensions on a scale of 1 to 5. He explains more on how he defined the rating in his blog. Let us assume that it is interval scale to run Multiple Linear Regression (OLS – Ordinary Least Squares).

Now, what are the problems with this predictive model?

  1. How reliable is the data? Mr.Jamison collected 200 startup pitches available to him (not random sampling mind you) and ex post gave the rating. That is, these are NOT the ratings his firm gave on these dimensions at the time of the pitch but done by Mr.Jamison now just for the purpose of this analysis.
    That is a biased sample with flawed measurement. You can stop right here and call him out. The rest of the article and his claims based on the regression analysis are point less.
  2. How good is  the model? . A multiple regression model is measured by two metrics. One,  R2  which is the strength of the relation between the explanatory variables and the dependent variable and two a measure of whether each variable’s relation is statistically significant (p-value < 0.05)
    This model has an R2 value of 0.5.  This means 50% of the changes in Liklihood  (the Right Hand Side variable) can be explained by changes in these four variables. But is each explanatory variable’s relation statistically significant? Mr. Jamison does not provide us t-stat (or p-value) data for us. This is likely because he simply ran the regression with all the variables and reported just the R2 .
    If one were to use the simplistic Excel’s DataAnalysis tool to run multiple-regression that is what one will get. In essence, we do not know how many of the three variables really have any effect on the Likelihood of Receiving Term Sheet.
    The right way to do the regression is to enter variables one at a time and see if its relation is statistically significant and if the R2 value changes with the addition of variable to the model. It is possible only one of the variable is relevant and its  Rcould be much lower than 50%.
    So all the explanations on importance of Team, Market, Traction that Mr.Jamison provides are irrelevant because they are  based on faulty analysis.
  3. About the use of term Likelihood: It is misleading as I first thought he was really measuring Likelihood using Logistic regression. It is OLS where he models Likelihood on a 5 point scale. That rating is quite meaningless: it is simply a binary variable, whether he extended term sheet or not. In which case he should be running Logistic Regression which measures the probability that a startup will get term sheet given the values of four explanatory variables.

Even if the model did not have any of these errors, there are still lurking variables. Regression is not causation despite the equation form. It is still correlation and there are many lurking variables including who introduced the startup for the pitch and whether the VCs identify themselves with the startup founders.

What this really means is VCs don’t have any real model for evaluating startups.  Consider this – if we took this raw data, stripped out the Likelihood variable and asked VCs (in general) to rate the likelihood, how different are these going to be from VC to VC and how different will these ratings be from one done based on coin-toss?

It would have been interesting if VCs had a scoring system for these four attributes and other dimensions,  as a team rated the startups right after the pitch and agreed to extend term sheet to only those that reached certain threshold.

But what we have here  is faulty data and analysis used to color gut calls as quantitative.

Are you going to willingly suspend your disbelief? Or …

My First GitHub Commit – Select Random Tweets

I have seen several reports that collect and analyze millions of tweets and tease out a finding from it. The problem with these reports is they do not start with any hypothesis and find the very hypothesis they are claiming to be true by looking at large volumes of data. Just because we have  Big Data, it does not mean we can suspend application of mind.

In the case of twitter analysis, only those with API skills had access to data. And they applied well their API skills to collect every possible tweet to make their prediction that are nothing more than statistical anomalies. Given millions of tweets, anything that looks interesting will catch the eye of a determined programmer seeking to sensationalize his findings.

I believe in random sampling. I believe in reasonable sample sizes, not whale of a sample size. I believe that abundance of data does not obviate the need for theory or mind. I am not alone here. I posit that the any relevant and actionable insight can only come from starting with relevant hypothesis based on prior knowledge and then using random sampling to test the hypothesis. You do not need millions of tweets for hypothesis testing!

To make it easier for those seeking twitter data to test their hypotheses I am making available a real simply script to select tweets at random. You can find the code at GitHub. You can easily change it to do any type of randomization and search queries. For instance you want to select random tweets that mention Justin Bieber, you can do that.

The script has bugs? I likely does. Surely others can pitch in to fix it.

Relevance not Abundance!

Small samples and test for statistical significance than all the data and statistical anomalies.

Does preschool lead to career success?

If you are reading this article it is highly likely your child has been in preschool or will attend preschool. But pick randomly any child from US population, you will find that only 50% chance the child goes to preschool.

The rest either stay home, where they play with parents or caregivers, or attend daycare, which may not have an educational component. Preschool isn’t mandatory, and in most places it’s not free. (Source : WSJ)

What is the observed difference in their later performance of those who attended preschool and those who didn’t?

According to Dr. Celia Ayala, research says preschool attendance points to stellar career.  She said,

“Those who go to preschool will go on to university, will have a graduate education, and their income level will radically improve,”

50% of children don’t get to attend preschool because of economic disparity. Seems only fair to democratize the opportunities for these children and provide them free preschool when their parents can’t afford them.

I do not have a stand on fairness but I have a position on the reported research and how they drew such a causation conclusion.

First I cannot make judgement on a research when someone simply says, “research says”, without producing the work, the data that went into it and the publication. Let us look at two possible ways the said research could have been conducted.

Cross-sectional Analysis – Grab a random sample of successful and unsuccessful adults and see if there is statistically significant difference in the number of those who attended preschool.  As a smart and analytically minded reader you can see the problem with cross-sectional studies. It cannot account for all different factors and confuses correlation with causation.

Longitudinal Analysis – This means studying over a period of time. Start with some preschoolers and some not in preschool and track their progress through education, college and career.  If there is statistically significant difference then you could say preschool helped. But you, the savvy reader, can see the same problems persist.  Most significantly it ignores the effect of parents – both their financial status and genes.
A parent who enrolls the child in preschool is more likely to be involved in every step of their growth. Even if you discount that, the child is simply lucky to start with smart parents.

So the research in essence is not actionable. Using it to divert resources to invest in providing preschool opportunity to those who cannot afford is not only misguided but also overlooks opportunity cost of the capital.

What if the resources could actually help shore up elementary, middle or high-school in low-income neighborhood? Or provide supplementary classes to those who are falling behind.

Failing to question the research, neglecting opportunity costs and blindly shoveling resources on moving a single metric will only result in moving the metric but with no tangible results.

Where do you stand?

Demand Validation – Don’t stop with what is easy, available and fits your hypothesis

As days get hotter, customers line up at Good Humor ice cream trucks only to be disappointed to find that their favorite ice cream, Toasted Almond Bar, is no more available. Truck after truck, customer after customer, similar story. Customers cannot believe the truck  does not any more carry their favorite product. (Full story here)

What is wrong with the business that does not know its own customers and their needs?

Why are they refusing to heed the validation they get from the ice cream trucks (their distribution channel) who are outside the corporate building and with the customers?

This is not because Unilever that owns the Good Humor brand is not customer centric but because it is looking at aggregate customer demand, not just handful of customer inputs. These anecdotes about disappointed customers are just that, anecdotes and do not provide demand validation.

One, two,…, hundred people walking up and demanding a product is not enough. When Unilever looks at its flavor mix, the hero of this story is actually the least popular, bringing in only 3% of the sales. Their data shows that the almond bar is popular only in Northeast especially among grown-ups (see footnote on segmentation).

Talking to handful of grownups from Northeast, just because these were the only ones available (like talking to few people in Coupa cafe in Palo Alto) is not demand validation.  These anecdotes can only help you frame better hypothesis about about customer needs and not proof for the hypothesis itself.

Even if you were to pick 100 grownups from Northeast (good enough sample size that will provide 95% confident answer at 10% margin of error),  you are going to end up with wrong answer about your customers. (Because you are not doing random sampling from your entire target segment.)

When it comes to demand validation do get out of the building. But when you return don’t go building almond bars because a few grownups in your Northeast neighborhood (or others at a boot-camp ) said so. You have some serious market analysis work to do.

Note on Segmentation: ‘Grownups in Northeast’ is not a segment. This is a measure of their customer mix. We still do not know why these people love this specific flavor.

Create a free website or blog at | The Baskerville Theme.

Up ↑

%d bloggers like this: