Tag Archives: Hypothesis

Can you add my one question to your survey?

No sooner you let it be known, mostly inadvertently, that you are about to send out a survey to customers than starts incessant requests (and commands) from your co-workers (and bosses) to add just one more question to it. Just one more question they have been dying to find the answer for but have not gotten around to do a survey or anything else to find the answer for.

Just one question right? What harm can it do? Sure you are not opening the floodgates and adding everyone’s question, just one question to satisfy the HiPPO?

May be I am unfair to all our colleagues. It is possible it is not them asking to add one more question, it is usually us who is tempted to add just one more question to the survey we are about to send out. If survey takers are already answering a few it can’t be that bad for them to answer one more?

The answer is yes of course it can be really bad. Resist any arm-twisting, bribing and your own temptation to add that one extra question to a carefully constructed survey. That is I am assuming you did carefully construct the survey, if not sure add them all, the answers are meaningless and in-actionable anyways.

To define what carefully constructed survey means we need to ask, “What decision are you trying to make with the data you will collect?”.

survey-processIf you do not have decisions to make, if you won’t do anything different based on the data collected or if you are committed to do whatever you are doing now and only collecting data to satisfy the itch then you are doing it absolutely wrong. And in that case yes please add that extra question from your boss for some brownie points.

So you do have decisions to make and made sure the data you seek is not available through any other channels. Then you need to develop a few hypotheses about the decision. You do that by doing the background exploratory research including customer one-on-one interviews, social media search analysis and if possible focus groups. Yes we are actually paid to make better hypothesis so you should take this step seriously.

For example your decision is how to price a software offering and your hypotheses is about value perception of certain key features and consumption models.

Once you develop a minimal set of well defined hypotheses to test, you design the survey to collect data to test those hypotheses.  Every question in your survey must serve to test one or more of the hypotheses. On the flip side you may not be able to test all your hypotheses in one survey and that is okay. But if there is a question that does not serve to test any of the hypotheses then it does not belong in that survey.Slide2

The last step is deciding the relevant target mailing list you want to send this survey to. After all there is no point is asking the right questions to wrong people.

Now you can see what adding that one extra question from your colleague does to your survey. It did not come from your decision process, does not help with your hypotheses, and most likely not relevant to the sample set you are using.

Estimate the amount of cash in Brad Pitt’s wallet

This isn’t original, I have read this somewhere nevertheless this serves to explain estimations, confidence interval and precision.

Suppose if I asked you to estimate how much cash does Brad Pitt carries in his wallet what would be your guess?

It is hard to guess it right. It is hard because I asked you to give a single number and given that there are many possibilities (even with just whole numbers) your answer is likely going to be wrong.  With you your guess of single value you cannot tell how confident you are about the estimate.  That is the problem with making single value estimate – be it estimating cash in a wallet or expected revenue impact of a marketing campaign. Don’t give a single number and don’t trust anyone giving you a single number.

What if we asked 1000 random people on the street to find all their answer and averaged it out. Would that give the right answer? Isn’t that wisdom of the crowd? Well it won’t be the right answer. But if you plot the answers and number of people who said each value on a graph (Histogram) you likely will see a Normal curve.  The distribution will tell us the low and high value of the cash in Brad Pitt’s wallet and also the chance that it will be outside this range.

Suppose 95% of the responses fall between $10 and $978 then we could say, “we are 95% confident Brad has $10 to $978″. Well we could be wrong in saying “we are 95% confident” if we received homogenous answers and hence got too narrow a range.

Instead of asking 1000 people what if I asked you not for a single value but to give your 95% confidence interval for the amount of cash in his wallet, what would be your answer?

It is the equivalent of asking 1000 different people. And since I asked for 95% confidence you should give a range so that there is only 5% chance the real answer is outside this range. I am not asking for precision (so don’t try to give a narrow range) but a high confidence level (so you should go wide).

Since you do not know anything about cash carrying habits of Hollywood stars you should trade-off precision for confidence. You could answer,  $0 and  $100,000.  That is acceptable but too wide a range to be of real use in cases other than estimating Brad Pitt’s wallet. However you can apply your knowledge about stuffing bills in a wallet and give a better range like $10 and $2000.

That is what you would do when measuring outcomes of events when there are many unknowns. You break down the BIG unknown into a set of component unknowns and for each smaller unknown you make an estimate at a given confidence level.  Stating a range with confidence level (confidence interval) based on application of prior knowledge is far better and usable than a single number that we are asked to trust.

“The marketing campaign will increase sales by 45%”

“We are 90% confident the marketing campaign will result in sales increase in the range of 20% to 43%”

Which one of these two is more trustworthy?

 

It is likely better to speak in absolutes

You read only interesting findings because only those get published, get written about and popularized in social media. Experiments that find no statistically significant difference don’t leave the filing cabinets of researchers because no one wants to read a story where nothing happens. This is such an experiment, where there was not enough evidence to reject the null hypothesis.

Let us start at the beginning. This experiment is about people’s perception of a person’s competence based on whether the person speaks in absolutes with no room for alternatives or whether the person speaks in terms of likelihood, accounting for alternative explanations.

There are several examples of those who speak in absolutes with no self-doubt. Read any CEO interview (enterprise or startup), management guru’s book or Seth Godin’s blog. Examples are,

“Revenue grew because of our marketing”
“Sales fell because of Europe”
“Groupon works, it really works”

An example of speaking in terms of likelihood comes from Nobel laureates in economics,

“Answers to questions like that require careful thinking and a lot of data analysis. The answers are not likely to be simple.”

Hypotheses: You do start with hypotheses before any data analysis don’t you?

Here are the hypotheses I had about speaking in absolutes/likelihoods and perception of competence.

H1: Business leaders are judged to be more competent when they speak in absolutes. Conversely, using terms like “likely” may be perceived as wishy-washy and hence signal incompetence.

H2: Scientists are judged to be more competent when they use likelihoods and avoid absolutes. (Because scientists are expected to think about all aspects and anyone who zones in on one factor must not know how to think about acenarios)

Of course the null hypothesis is there is no statistically significant difference in perception of competence based on whether the subject in question speaks in absolutes or in likelihoods.

Experiment Design: So I designed a simple 2X2 experiment, using SurveyGizmo. You can see the four groups, Company Executive and Scientist as one dimension, Absolutes and Likelihoods on the other. I designed a set of 4 statements with these combinations. When people clicked on the survey they were randomly shown one of the four options.

Here is one of the four statements

This was a very generic statement meant to speak about results and what could have caused it. I avoided specific statements because people’s domain knowledge and preconceived notions come into play. For example, if I had used a statement about lean startup or social media it would have resulted in significant bias in people’s answers.

Based on just one statement, without context, people were asked to rate the competence of the person. Some saw this about Scientists, some about a Company Executive.

Note that an alternate design is to show both Absolute and Likelihood statement and ask the respondents to pick the one they believe to be more competent. I believe that would lead to experimental bias as people may start to interpret the difference between two statements.

Results:  I collected  130 responses, almost evenly split between four groups and did t-test on the mean rating between the groups (Scientists: Absolute/Likelihood, Executive: Absolute/Likelihood, Absolute: Executive/Scientist, Likelihood: Executive/Scientist). And you likely guessed the results from my opening statements.

There is not enough evidence to reject the null hypothesis in all the different tests. That means and difference we see in competence perception of those speaking in absolutes and likelihoods is just random.

What does this mean to you?

Speaking in absolutes, a desired trait that leaders cultivate to be seen as competent and decisive leader, has no positive effect. Including uncertainties does not hurt either.

So go right ahead and present simplistic one size fits all solutions without self-doubt.  After all stopping to think about alternatives and uncertainties only takes time and hurts ones brain with no positive effect on the audience.

Caveats: While competence is not an issue I believe trust perception could be different. That requires another experiment.

4 Ways You Can Put Google Customer Surveys To Work Today

As I previously wrote, Google Customer Surveys is a true business model innovation. It helps publishers unlock value from their digital assets and enables market researchers reach new audience they otherwise would not have found. I expressed my reservations on their positioning in my previous article

But I do not get what they mean by, “look for correlations between questions” and definitely don’t get, “pull out hypotheses”. It is us, the decision makers,who make the hypothesis in the hypothesis testing. We are paid to make better hypotheses that are worthy of testing.

Since I wrote that article, their Product Manager emailed to say they removed their statement on, “pull out hypothesis”.

This is a limited tool with ability to ask just one question and no way to ensure that the same user will answer multiple questions for doing customer level analysis.

There is one more item which is their minimum sample size. You cannot order anything less than 1000 samples.

Despite these reservations I see Google Customer Surveys as an effective tool for product/brand managers, researchers and small businesses for these purposes:

1. Aided Recall:  Present them a choice of different brands ask them how many of these they recognize.
When you are trying to get very quick and high level data on customer awareness or preference of your brand, this is a great tool. The results are especially actionable when you get extreme results like no one knows about you.
If you are trying to find which brand they recognize the most then you can do that as well with different question type. However, due to its question format limitation, Google Customer Surveys cannot help with Unaided recall.

2. Finding Consideration Set: Present them a choice of different brands and ask them how many will they consider buying for solving a particular need. This is similar to Aided Recall but the question is more focused. You are not simply asking about awareness but whether your brand makes it into their consideration set.

3. Brand Association: Present them an image or a statement and ask them to pick a tag-line or brand they believe goes with it. Another variation of this question is asking them to associate your brand with an unrelated field. A typical example is, “if our brand were a movie actor, who will it be”.

Ability to use images is a very powerful feature. It creates many different opportunities. For example for testing your advertising copy or the images you use in your collateral. It is better to poll your audience whether the image you used looks more like a bean bag or boxing glove before you launch your expensive advertising campaign.

4. Consumer Behavior Research: This is a whole class of hypothesis testing you can do with Google Customer Surveys. While it is not a tool for A/B split testing, you can use it test your hypothesis on customer preferences or their susceptibility to anchors and other nudges. Before collecting results you need to specify a reasonable hypothesis that is worth testing. When you collect data you can test for statistical significance using Chi-square test to validate your hypothesis. Do keep in mind that sometimes data can fit more than one hypotheses

There is however a big limitation because of the length of questions you can ask (as you see in the third option in the image on the left).

There you have it. A tool with limitations but is effective for specific areas. It opens up new ways to collect data and test when none existed before.

A corollary for this post would be cases where you should not use this tool. That includes finding price customers are willing to pay or asking them about how important a single feature is. You have to wait for another post for the reasons.

Yet Another Causation Confusion – School Ratings and Home Forclosures

Here is the headline from WSJ,

One Antidote to Foreclosures: Good Schools

The article is based on  data mining by Location Inc, that found that

over past six months percentage of foreclosure (or “real-estate-owned”) sales went down as the school ranking went up in five metro areas

Next we see a news story attributing clear causation. This one is hard to notice for some as it appears to be longitudinal – reported over six month period. If it were a simpler cross-sectional study most will catch the fallacy right away.

First let me point out this is the problem with data mining – digging for nuggets in mountains of Big Data without an initial hypothesis and finding such causations.

Second school rating improvement could be due to random factors that coincide with lower foreclosure.

Third despite the fact that longitudinal aspect implies causation there are many omitted variables here – the common factors that are driving down foreclosure and driving up school rating.

School rating is not an independent metric. It relies not just on teacher performance  but also on parents. The same people who are willing to work with their kids are also likely to be fiscally responsible. Another controversial but a proven factor is the effect of genes on children’s performance.

Ignoring all these if we focus our resources on improving school ratings to solve foreclosure crisis, we will be chasing away the wolves that cause eclipses with loud noises.

A Closer Look at A/B testing

Suppose you have a service – be it a web based service or a brick and mortar service.  Visitors walk through the front door. Most just leave without taking an action that is favorable to you. Some do and become converts.

As you  function along, you form a belief/gut-feel/hypothesis  that color of the door affects how many will convert.  Specifically, certain color will improve conversion. (I am color blind, else I will call out the colors I used below)

You verify this hypothesis by running a split test. You evenly split your visitor stream, randomly sending them through Door A of current color or Door B of the new color which is supposed to increase conversion. This is the A/B split test.

How do you verify your hypothesis?

The most common way that is practiced by every A/B test tool in the market is shown below

These tools keep  both Converts and Non-Converts for a given Door together and treats each as a separate population.  Those who went through Door A  (both Converts and Non-Converts) are kept separate from those who went through Door B.  They test the hypothesis that the proportion of converts in the Door B population is higher than proportion of converts in the Door A population. The tools assume that the population data are normally distributed and use a 2-sample t-test to verify the  difference between the two proportions is statistically significant.

What is wrong with this approach? For starters, you can see how it rewrites the hypothesis and re-wires the model. This approach treats conversion as an attribute of the visitor. This is using the t-test for the wrong purpose or using the wrong statistical test for A/B testing.

For example, if you want to test whether there is higher prevalence of heart disease among Indians living in US vs. India, you will draw random samples from the two populations and, measure the proportion of heart disease in each sample and do a t-test to see if the difference is statistically significant. That is a valid use of t-test for population proportions.

Conversion isn’t same as measuring proportion of population characteristic like heart disease. Treating the conversion rate as a characteristic of the visitor is contrived. You also need to keep the Converts and Non-Converts together while you only need to look at those who converted.

Is there another way?

Yes. Take a look at this model that closely aligns with the normal flow. We really do not care about the Non-Converts and we test the correct hypothesis that more Converts came through Door B than through Door A.

This method grabs a random sample of Converts and tests whether there are more that came through Door B than through Door A. It uses Chi-square test to verify that the difference is not just due to randomness. No other assumptions needed like assuming normal distribution and it tests the right hypothesis. Most importantly it fits the flow and model before we introduced Door B.

Want to know more? Want to know the implications of this and how you can influence your A/B test tool vendors to change?  Drop me a note.