It is likely better to speak in absolutes

You read only interesting findings because only those get published, get written about and popularized in social media. Experiments that find no statistically significant difference don’t leave the filing cabinets of researchers because no one wants to read a story where nothing happens. This is such an experiment, where there was not enough evidence to reject the null hypothesis.

Let us start at the beginning. This experiment is about people’s perception of a person’s competence based on whether the person speaks in absolutes with no room for alternatives or whether the person speaks in terms of likelihood, accounting for alternative explanations.

There are several examples of those who speak in absolutes with no self-doubt. Read any CEO interview (enterprise or startup), management guru’s book or Seth Godin’s blog. Examples are,

“Revenue grew because of our marketing”
“Sales fell because of Europe”
“Groupon works, it really works”

An example of speaking in terms of likelihood comes from Nobel laureates in economics,

“Answers to questions like that require careful thinking and a lot of data analysis. The answers are not likely to be simple.”

Hypotheses: You do start with hypotheses before any data analysis don’t you?

Here are the hypotheses I had about speaking in absolutes/likelihoods and perception of competence.

H1: Business leaders are judged to be more competent when they speak in absolutes. Conversely, using terms like “likely” may be perceived as wishy-washy and hence signal incompetence.

H2: Scientists are judged to be more competent when they use likelihoods and avoid absolutes. (Because scientists are expected to think about all aspects and anyone who zones in on one factor must not know how to think about acenarios)

Of course the null hypothesis is there is no statistically significant difference in perception of competence based on whether the subject in question speaks in absolutes or in likelihoods.

Experiment Design: So I designed a simple 2X2 experiment, using SurveyGizmo. You can see the four groups, Company Executive and Scientist as one dimension, Absolutes and Likelihoods on the other. I designed a set of 4 statements with these combinations. When people clicked on the survey they were randomly shown one of the four options.

Here is one of the four statements

This was a very generic statement meant to speak about results and what could have caused it. I avoided specific statements because people’s domain knowledge and preconceived notions come into play. For example, if I had used a statement about lean startup or social media it would have resulted in significant bias in people’s answers.

Based on just one statement, without context, people were asked to rate the competence of the person. Some saw this about Scientists, some about a Company Executive.

Note that an alternate design is to show both Absolute and Likelihood statement and ask the respondents to pick the one they believe to be more competent. I believe that would lead to experimental bias as people may start to interpret the difference between two statements.

Results:  I collected  130 responses, almost evenly split between four groups and did t-test on the mean rating between the groups (Scientists: Absolute/Likelihood, Executive: Absolute/Likelihood, Absolute: Executive/Scientist, Likelihood: Executive/Scientist). And you likely guessed the results from my opening statements.

There is not enough evidence to reject the null hypothesis in all the different tests. That means and difference we see in competence perception of those speaking in absolutes and likelihoods is just random.

What does this mean to you?

Speaking in absolutes, a desired trait that leaders cultivate to be seen as competent and decisive leader, has no positive effect. Including uncertainties does not hurt either.

So go right ahead and present simplistic one size fits all solutions without self-doubt.  After all stopping to think about alternatives and uncertainties only takes time and hurts ones brain with no positive effect on the audience.

Caveats: While competence is not an issue I believe trust perception could be different. That requires another experiment.

Significance of Random Flukes

One sure way to kill a joke is to explain it. I hate to kill this great and clever joke on statistical significance, but here it goes. May be you want to just read the joke, love it, treasure it and move on without reading rest of the article.

Love it? I love this one for its simple elegance. You can leave now if you do not want to see this dissected.

First the good things.

The “scientists” start with hypotheses external to the data source, collect data and test for statistical significance. They likely used 1-tailed t-test and ran a between groups experiment.

One group was control group that did not eat the jelly beans. Other group was the treatment group that was treated with jelly beans.

The null hypothesis H0 is, “ Any observed difference between the number of occurrences of acne between the two groups is just due to coincidence”.

The alternative hypothesis H1 is, “The differences are statistically significant. The jelly beans made a difference”.

They use p value of 0.05 (95% confidence). p-value of 0.05 means there is only 5% probability the observed result can be entirely due to chance. If the computed p-value is less than 0.05 (p<0.05), they reject H0 and accept H1. If the computed p-value is greater than 0.05  (p>0.05) H0 cannot be rejected, it is all random.

They run a total of 21 experiments.

The first is the aggregate. They likely used a large jar of mixed color jelly beans and ran the test and found no compelling evidence to overthrow the null hypothesis that it was just coincidence.

Then they run 20 more experiments, one for each color. They find that in 19 of the experiments (with 19 different colors) they cannot rule out coincidence. But in one experiment using green jelly beans they find p less than 0.05. They  reject H0 and accept H1 that green jelly beans made a difference in the number of occurrences of acne.

In 20 out of 21 experiments (95.23%), the results were not significant to toss out coincidence as the reason. In 1  out of 21 experiments (4.77% ) it was and hence green was linked to acne.

In other words, there was 95.23% probability (p=0.9523) that any observed link between jelly bean and acne is just random.

However the media reports, “Green jelly beans linked to acne at 95% confidence level”, because that experiment found p<0.05.

Green color is the spurious variable. The fact that the Green experiment had p<0.05 could easily be because this experiment run happened to have high concentration of random flukes in it.

The very definition of statistical significance testing using random sampling is just that.

If we have not seen the first experiment  or the 19 other experiments that had p>0.05, we would be tempted to accept the link between green jelly beans and acne. Since we saw all the negative results, we know better.

In reality, we don’t see most if not all of the negative findings. Only the positive results get written up – be it the results of an A/B test that magically increased conversion or scientific research.
After all it is not interesting to read how changing just one word did not have a effect on conversion rates. Such negative findings deserve their place in the round filing cabinet.

By rejecting all negative findings and choosing only postive findings, the experimenters violate the rule of random sampling and  highlight the high concentration of random flukes as breakthrough.

The next step in this slippery slope of pseudo statistical testing is Data Dredging. Here one skips the need for initial hypotheses and simply launches into data for finding “interesting correlations”.
Data Dredging is slicing up data in every possible dimension to find something – anything.

For example, “Eating Green jelly beans with left hand while standing up on Tuesdays” causes acne.

If you find this claim is so ridiculous that you will not fall for it, consider all the articles you have seen about the best days to run email marketing  and best days to tweet OR How to do marketing like  certain brands.

Can you spot the fact that these are based on Data Dredging?

(See here for great article on Data Dredging).

In this age of instant publication, easy experimentation, Big Data and social media echo chamber, how can you spot and stay clear of Random Flukes reported as scientific research?

You can start with this framework:

  1. What are the initial hypotheses before collecting data? If there is none, thanks but no thanks. (Data Dredging)
  2. How are these hypotheses arrived at? If these were derived from the very data they are tested with, keep moving. A great example of this is the class of books, “7 Habits of …”
  3. See a study from extremely large samples? Many of the random flukes that do not show up in small samples do get magnified in large samples. It is just due to the mathematical artifact. Again, thanks but no thanks.
  4. Very narrow finding? It is the Green jelly bean again ask about other dimensions.

Or you can just plain ignore all these nonsensical findings camouflaged in analytics.