Dim Light and Ambient Noise Increase Creativity … Not among skeptics

Here are two studies that got lots of social media mileage this week

  1. Dimming lights enhances creativity
  2. Right amount of ambient noise increases creativity

Quoting from the light study, Salon writes

during that all-important phase when you’re grasping for ideas, dim light appears to be a catalyst for creativity. “Darkness changes a room’s visual message,” the researchers explain, and one such message seems to be it’s safe to explore widely and let your imagination run free.

And Salon goes on to make general prescription for all of us

So if you’re struggling to finish that screenplay or come up with the next must-have app, you might try illuminating your workspace with one bare bulb of minimal wattage.s

Quoting from the ambient noise study,  99u writes

 moderate level of background noise creates just enough distraction to break people out of their patterns of thinking and nudge them to let their imagination wander, while still keeping them from losing their focus on the project all together. This distracted focus helps enhance your creativity.

And their recommendations are to roam about in hotel lobbies and coffee shop. If you cannot roam about they have Apps for that (with names like Coffitivity)

Before you give a second look to these studies, despite the fact that these are published in peer reviewed journals and found statistically significant difference, stop and ask some critical questions.

  1. Are the effects additive? So would it work better if we roamed about in dim coffee shops?
  2. The ambient noise study reports difference between 50dB and 70dB. How many other pairs did they test and reject before reporting on a pair that had statistically significant difference? (See green jelly beans cause acne). Only those sound levels that showed difference get reported and the rest get filed in the round filing cabinet. And journals like publishing only those studies that find statistically significant difference. Remember that when motivated researchers keep looking for something interesting they are bound to find it.
  3. What is the measure of creativity? Is that correct and relevant? The ambient noise study used Remote Associates Test while the dim light study used Creative Insights Problem. Why two different metrics by the two studies? Did they try all kinds of measures and picked the one that showed statistically significant difference? If you repeated the Dim Light experiment with Remote Associates Test and Ambient Noise experiment with Creative Insights Problem will the results hold?
  4. Let us say all these are answered in favor of the tests. Does that mean the results translate into real world that has far too many variables? How many hidden hypothesis did the two researchers took for granted in coming up with their results? How many of those will be violated in the real world?
  5. Does statistical significance mean economic significance? What is the economic impact of any perceived difference?
  6. Do you have means to measure creativity of your team that are based on real life results and  do not involve administering Remote Associates Test or Creative Insights Problem? Performance in tests like these is rarely an indication of actual job performance, as Google found out about brainteasers in job interviews.
  7. Don’t forget opportunity cost. You see here two recipes for improving creativity, you will find more if you look for them. All peer reviewed, I bet. Which one can you afford to pick? Or could you be investing your time and resources in something else instead of dimming lights and creating coffee shop noise?
  8. Even if you go on to dim lights or create coffee shop noise there is always Hawthorne effect.  Either your team will react to it, tell you that they are seeing improvement or you will convince yourself you are seeing improvement because you do not want to be wrong about your decision to pipe coffee shop noise through office speakers.

Finally it behooves us to keep in mind the great academic fraud of Diederik Stapel.  I am not saying anything remotely close to what Stapel did happened in these two studies. But I am urging you to start with high skepticism and place the further burden of proof of applications on the researchers and on those writing derivative works based on the research.

Your opportunity cost of internalizing these studies and acting on them is far too high. You don’t want to bump into other Salon readers wandering in dim and noisy hotel lobbies.

If you want an excuse to get your work done in a coffee shop, do it. But do not try to justify it with scientific reason.

Finally, if you are allowed to believe in only one of the two studies which one will you believe?

Testing 40 shades of blue – AB Testing

The title refers to the famous anecdote about Marissa Mayer testing 40 shades of blue to determine the right color for the links. (Unfortunately I am colorblind, I know just one blue.)

Mayer is famous for many things at Google, but the one that always sticks out – and defines her in some ways – is the “Forty Shades of Blue” episode.

she ordered that 40 different shades of blue would be randomly shown to each 2.5% of visitors; Google would note which colour earned more clicks. And that was how the blue colour you see in Google Mail and on the Google page was chosen.

Thousands of such tests happen in the web world, every website running multiple experiments in a day. Contrary to what most in webapp development may believe AB testing does not have its origins in webapp world. It is simply an application of statistical testing, Randomized Control Trial, to determine if a ‘treatment’ made a difference on the performance of treatment group compared to performance of control group.

The simplest test is testing if the observed difference between the two sample means are statistically significant. What that means is measuring the probability, p-value, the difference is just random. If p-value is less than a preset level we declare the treatment made a difference.

Does it matter if the results are statistically significant? See here why it does not:

“I have published about 800 papers in peer-reviewed journals and every single one of them stands and falls with the p-value. And now here I find a p-value of 0.0001, and this is, to my way of thinking, a completely nonsensical relation.”

Should you test 40 shades of blue to find the one that produces most click-thrus or conversions? xkcd has the answer:

Can Ms. Mayer test the way out of Yahoo’s current condition? Remember all these split testing are about finding lower hanging fruits not quantum leaps. And as Jim Manzi wrote in his book Uncontrolled,

Perhaps the single most important lesson I learned in commercial experimentation, and that I have since seen reinforced in one social science discipline after another, is that there is no magic. I mean this in a couple of senses. First, we are unlikely to discover some social intervention that is the moral equivalent of polio vaccine. There are probably very few such silver bullets out there to be found. And second, experimental science in these fields creates only marginal improvements. A failing company with a poor strategy cannot blindly experiment its way to success …

You can’t make up for poor strategy with incessant experimentation.

My First GitHub Commit – Select Random Tweets

I have seen several reports that collect and analyze millions of tweets and tease out a finding from it. The problem with these reports is they do not start with any hypothesis and find the very hypothesis they are claiming to be true by looking at large volumes of data. Just because we have  Big Data, it does not mean we can suspend application of mind.

In the case of twitter analysis, only those with API skills had access to data. And they applied well their API skills to collect every possible tweet to make their prediction that are nothing more than statistical anomalies. Given millions of tweets, anything that looks interesting will catch the eye of a determined programmer seeking to sensationalize his findings.

I believe in random sampling. I believe in reasonable sample sizes, not whale of a sample size. I believe that abundance of data does not obviate the need for theory or mind. I am not alone here. I posit that the any relevant and actionable insight can only come from starting with relevant hypothesis based on prior knowledge and then using random sampling to test the hypothesis. You do not need millions of tweets for hypothesis testing!

To make it easier for those seeking twitter data to test their hypotheses I am making available a real simply script to select tweets at random. You can find the code at GitHub. You can easily change it to do any type of randomization and search queries. For instance you want to select random tweets that mention Justin Bieber, you can do that.

The script has bugs? I likely does. Surely others can pitch in to fix it.

Relevance not Abundance!

Small samples and test for statistical significance than all the data and statistical anomalies.

Does preschool lead to career success?

If you are reading this article it is highly likely your child has been in preschool or will attend preschool. But pick randomly any child from US population, you will find that only 50% chance the child goes to preschool.

The rest either stay home, where they play with parents or caregivers, or attend daycare, which may not have an educational component. Preschool isn’t mandatory, and in most places it’s not free. (Source : WSJ)

What is the observed difference in their later performance of those who attended preschool and those who didn’t?

According to Dr. Celia Ayala, research says preschool attendance points to stellar career.  She said,

“Those who go to preschool will go on to university, will have a graduate education, and their income level will radically improve,”

50% of children don’t get to attend preschool because of economic disparity. Seems only fair to democratize the opportunities for these children and provide them free preschool when their parents can’t afford them.

I do not have a stand on fairness but I have a position on the reported research and how they drew such a causation conclusion.

First I cannot make judgement on a research when someone simply says, “research says”, without producing the work, the data that went into it and the publication. Let us look at two possible ways the said research could have been conducted.

Cross-sectional Analysis – Grab a random sample of successful and unsuccessful adults and see if there is statistically significant difference in the number of those who attended preschool.  As a smart and analytically minded reader you can see the problem with cross-sectional studies. It cannot account for all different factors and confuses correlation with causation.

Longitudinal Analysis – This means studying over a period of time. Start with some preschoolers and some not in preschool and track their progress through education, college and career.  If there is statistically significant difference then you could say preschool helped. But you, the savvy reader, can see the same problems persist.  Most significantly it ignores the effect of parents – both their financial status and genes.
A parent who enrolls the child in preschool is more likely to be involved in every step of their growth. Even if you discount that, the child is simply lucky to start with smart parents.

So the research in essence is not actionable. Using it to divert resources to invest in providing preschool opportunity to those who cannot afford is not only misguided but also overlooks opportunity cost of the capital.

What if the resources could actually help shore up elementary, middle or high-school in low-income neighborhood? Or provide supplementary classes to those who are falling behind.

Failing to question the research, neglecting opportunity costs and blindly shoveling resources on moving a single metric will only result in moving the metric but with no tangible results.

Where do you stand?

Demand Validation – Don’t stop with what is easy, available and fits your hypothesis

As days get hotter, customers line up at Good Humor ice cream trucks only to be disappointed to find that their favorite ice cream, Toasted Almond Bar, is no more available. Truck after truck, customer after customer, similar story. Customers cannot believe the truck  does not any more carry their favorite product. (Full story here)

What is wrong with the business that does not know its own customers and their needs?

Why are they refusing to heed the validation they get from the ice cream trucks (their distribution channel) who are outside the corporate building and with the customers?

This is not because Unilever that owns the Good Humor brand is not customer centric but because it is looking at aggregate customer demand, not just handful of customer inputs. These anecdotes about disappointed customers are just that, anecdotes and do not provide demand validation.

One, two,…, hundred people walking up and demanding a product is not enough. When Unilever looks at its flavor mix, the hero of this story is actually the least popular, bringing in only 3% of the sales. Their data shows that the almond bar is popular only in Northeast especially among grown-ups (see footnote on segmentation).

Talking to handful of grownups from Northeast, just because these were the only ones available (like talking to few people in Coupa cafe in Palo Alto) is not demand validation.  These anecdotes can only help you frame better hypothesis about about customer needs and not proof for the hypothesis itself.

Even if you were to pick 100 grownups from Northeast (good enough sample size that will provide 95% confident answer at 10% margin of error),  you are going to end up with wrong answer about your customers. (Because you are not doing random sampling from your entire target segment.)

When it comes to demand validation do get out of the building. But when you return don’t go building almond bars because a few grownups in your Northeast neighborhood (or others at a boot-camp ) said so. You have some serious market analysis work to do.

Note on Segmentation: ‘Grownups in Northeast’ is not a segment. This is a measure of their customer mix. We still do not know why these people love this specific flavor.

It is likely better to speak in absolutes

You read only interesting findings because only those get published, get written about and popularized in social media. Experiments that find no statistically significant difference don’t leave the filing cabinets of researchers because no one wants to read a story where nothing happens. This is such an experiment, where there was not enough evidence to reject the null hypothesis.

Let us start at the beginning. This experiment is about people’s perception of a person’s competence based on whether the person speaks in absolutes with no room for alternatives or whether the person speaks in terms of likelihood, accounting for alternative explanations.

There are several examples of those who speak in absolutes with no self-doubt. Read any CEO interview (enterprise or startup), management guru’s book or Seth Godin’s blog. Examples are,

“Revenue grew because of our marketing”
“Sales fell because of Europe”
“Groupon works, it really works”

An example of speaking in terms of likelihood comes from Nobel laureates in economics,

“Answers to questions like that require careful thinking and a lot of data analysis. The answers are not likely to be simple.”

Hypotheses: You do start with hypotheses before any data analysis don’t you?

Here are the hypotheses I had about speaking in absolutes/likelihoods and perception of competence.

H1: Business leaders are judged to be more competent when they speak in absolutes. Conversely, using terms like “likely” may be perceived as wishy-washy and hence signal incompetence.

H2: Scientists are judged to be more competent when they use likelihoods and avoid absolutes. (Because scientists are expected to think about all aspects and anyone who zones in on one factor must not know how to think about acenarios)

Of course the null hypothesis is there is no statistically significant difference in perception of competence based on whether the subject in question speaks in absolutes or in likelihoods.

Experiment Design: So I designed a simple 2X2 experiment, using SurveyGizmo. You can see the four groups, Company Executive and Scientist as one dimension, Absolutes and Likelihoods on the other. I designed a set of 4 statements with these combinations. When people clicked on the survey they were randomly shown one of the four options.

Here is one of the four statements

This was a very generic statement meant to speak about results and what could have caused it. I avoided specific statements because people’s domain knowledge and preconceived notions come into play. For example, if I had used a statement about lean startup or social media it would have resulted in significant bias in people’s answers.

Based on just one statement, without context, people were asked to rate the competence of the person. Some saw this about Scientists, some about a Company Executive.

Note that an alternate design is to show both Absolute and Likelihood statement and ask the respondents to pick the one they believe to be more competent. I believe that would lead to experimental bias as people may start to interpret the difference between two statements.

Results:  I collected  130 responses, almost evenly split between four groups and did t-test on the mean rating between the groups (Scientists: Absolute/Likelihood, Executive: Absolute/Likelihood, Absolute: Executive/Scientist, Likelihood: Executive/Scientist). And you likely guessed the results from my opening statements.

There is not enough evidence to reject the null hypothesis in all the different tests. That means and difference we see in competence perception of those speaking in absolutes and likelihoods is just random.

What does this mean to you?

Speaking in absolutes, a desired trait that leaders cultivate to be seen as competent and decisive leader, has no positive effect. Including uncertainties does not hurt either.

So go right ahead and present simplistic one size fits all solutions without self-doubt.  After all stopping to think about alternatives and uncertainties only takes time and hurts ones brain with no positive effect on the audience.

Caveats: While competence is not an issue I believe trust perception could be different. That requires another experiment.