## Pig or a Dog – Which is Smarter?: Metric, Data, Analytics and Errors

How do you determine which interview candidate to hire? How do you evaluate the candidate you decided you want to hire? (or decided you want to flush?)

How do you make a call on which group is performing better? How do you hold accountable (or explain away) bad performance in a quarter for one group vs. other?

How do you determine future revenue potential of a target company you decided you want to acquire? (or decided you don’t want to acquire)?

What metrics do you use? What data do you collect? And how do you analyze that to make a call?

Here is a summary of an episode from Fetch With Ruff Rufman, PBSKids TV show:

Ruff’s peer, Blossom the cat, informs him pigs are smarter than dogs. Not believing her and determined to prove her wrong, Ruff sends two very smart kids to test. The two kids go to a farm with a dog and a pig. They decide that time taken to traverse a maze as the metric they will use to determine who is smarter. They design three different mazes

1. A real simple straight line  (very good choice as this will serve as baseline)
2. A maze with turn but no dead-ends (increasing complexity)
3. A maze with two dead-ends

Then they run three experiments, letting the animals traverse the maze one at a time and measuring the time for each run. The dog comes out ahead taking less than ten seconds in each case while the pig consistently takes more than a minute.

Let me interrupt here to say that kids did not really want Ruff to win the argument. But the data seemed to show otherwise. So one of the kid changes the definition on the fly.

“May be we should re-run the third maze experiment. If the pig remembered the dead-ends and avoids them then it will show the pig is smarter because the pig is learning”

And they do. The dog takes ~7 seconds compared to 5.6 seconds it took in the first run. The pig does it in half the time, 35 seconds, as its previous run.

They write up their results. The dog’s performance worsened while pig’s improved. So the pig clearly showed learning and the dog didn’t. The pig indeed was smarter.

We are not here to critique the kids. This is not about them. This is about us, leaders, managers and marketers who have to make such calls in our jobs. The errors we make are not that different from the ones we see in the Pigs vs. Dogs study.

Are we even aware we are making such errors? Here are five errors to watch out for in our decision making:

1. Preconceived notion: There is a difference between a hypothesis you want to test vs. proving a preconceived notion.

A hypothesis is, ” Dogs are smarter than pigs”.  So is, “The social media campaign helped increase sales”.

A preconceived notion is, “Let us prove dogs are smarter than pigs”. So is, “let us prove that the viral video of man on horse helped increase sales”.

2. Using right metric:  What defines success and what better means must be defined in advance and should be relevant to the hypothesis you are testing.
Time to traverse maze is a good metric but is that the right one to determine which animal is smart? Whether smart or not dogs have an advantage over pigs – they respond to trainer’s call and move in that direction. Pigs only respond to presence of food. That seems unfair already.
Measuring presence of a candidate may be a good but is that the right metric for the position you are hiring for? Measuring number of views on your viral video is good but is that relevant to performance?
It is usually bad choice to pick a single metric. You need a basket of metrics that taken together point to which option is better.
3. Data collection: Are you collecting all the relevant data vs. collecting what is convenient and available?  If you want to prove Madagasar is San Diego then you will only look for white sandy beaches. If you stop after finding a single data point that fits your preconceived notion you will end taking \$9B write down on that acquisition.
Was it enough to test one dog and one pig to make general claim about dogs and pigs?
Was one run of each experiment enough to provide relevant data?
4. Changing definitions midstream: Once you decide on the hypothesis to test, metrics and experimental procedure you should stick to that for the scope of the study and not change it when it appears the results won’t go your way.
There is nothing wrong in changing definition but you have to start over and be consistent.
5. Analytics errors: Can you make sweeping conclusions about performance without regard to variations?
Did the dog really worsen or the pig really improve or was it simply regression to the mean?
Does 49ers backup quarterback really have hot-hand that justifies benching Alex Smith?What you see as sales jump from your social media campaign could easily be due to usual variations in sales performance. Did you measure whether the performance uplift is beyond the usual variations by measuring against a comparable baseline?

How do you make decisions? How do you define your metrics, collect data and do your analysis?

Note: It appears from a different controlled experiment that pigs are indeed smarter. But if they are indeed so smart how did they end up as lunch?

## Testing 40 shades of blue – AB Testing

The title refers to the famous anecdote about Marissa Mayer testing 40 shades of blue to determine the right color for the links. (Unfortunately I am colorblind, I know just one blue.)

Mayer is famous for many things at Google, but the one that always sticks out – and defines her in some ways – is the “Forty Shades of Blue” episode.

she ordered that 40 different shades of blue would be randomly shown to each 2.5% of visitors; Google would note which colour earned more clicks. And that was how the blue colour you see in Google Mail and on the Google page was chosen.

Thousands of such tests happen in the web world, every website running multiple experiments in a day. Contrary to what most in webapp development may believe AB testing does not have its origins in webapp world. It is simply an application of statistical testing, Randomized Control Trial, to determine if a ‘treatment’ made a difference on the performance of treatment group compared to performance of control group.

The simplest test is testing if the observed difference between the two sample means are statistically significant. What that means is measuring the probability, p-value, the difference is just random. If p-value is less than a preset level we declare the treatment made a difference.

Does it matter if the results are statistically significant? See here why it does not:

“I have published about 800 papers in peer-reviewed journals and every single one of them stands and falls with the p-value. And now here I find a p-value of 0.0001, and this is, to my way of thinking, a completely nonsensical relation.”

Should you test 40 shades of blue to find the one that produces most click-thrus or conversions? xkcd has the answer:

Can Ms. Mayer test the way out of Yahoo’s current condition? Remember all these split testing are about finding lower hanging fruits not quantum leaps. And as Jim Manzi wrote in his book Uncontrolled,

Perhaps the single most important lesson I learned in commercial experimentation, and that I have since seen reinforced in one social science discipline after another, is that there is no magic. I mean this in a couple of senses. First, we are unlikely to discover some social intervention that is the moral equivalent of polio vaccine. There are probably very few such silver bullets out there to be found. And second, experimental science in these fields creates only marginal improvements. A failing company with a poor strategy cannot blindly experiment its way to success …

You can’t make up for poor strategy with incessant experimentation.

## My First GitHub Commit – Select Random Tweets

I have seen several reports that collect and analyze millions of tweets and tease out a finding from it. The problem with these reports is they do not start with any hypothesis and find the very hypothesis they are claiming to be true by looking at large volumes of data. Just because we have  Big Data, it does not mean we can suspend application of mind.

In the case of twitter analysis, only those with API skills had access to data. And they applied well their API skills to collect every possible tweet to make their prediction that are nothing more than statistical anomalies. Given millions of tweets, anything that looks interesting will catch the eye of a determined programmer seeking to sensationalize his findings.

I believe in random sampling. I believe in reasonable sample sizes, not whale of a sample size. I believe that abundance of data does not obviate the need for theory or mind. I am not alone here. I posit that the any relevant and actionable insight can only come from starting with relevant hypothesis based on prior knowledge and then using random sampling to test the hypothesis. You do not need millions of tweets for hypothesis testing!

To make it easier for those seeking twitter data to test their hypotheses I am making available a real simply script to select tweets at random. You can find the code at GitHub. You can easily change it to do any type of randomization and search queries. For instance you want to select random tweets that mention Justin Bieber, you can do that.

The script has bugs? I likely does. Surely others can pitch in to fix it.

Relevance not Abundance!

Small samples and test for statistical significance than all the data and statistical anomalies.

Summary:Great business model innovation that points to the future of unbundled pricing. But is Google customer survey an effective marketing research tool? Do not cancel SurveyGizmo subscription yet.

Google’s new service, Customer Surveys, is truly a business model innovation. It unlocks value by creating a three sided market:

1. Content creators who want to monetize their content in an unbundled fashion (charge per article, charge per access etc)
2. Readers who want access to paid content without having to subscribe for entire content or muddle through micro-payments (pay per access)
3. Brands seeking customer insights, willing to pay for it but have been unable to find a reliable or cheaper way to get this
When readers want to access premium content they can get it by answering a question posed by one of the brands instead of paying for access. Brands create surveys using Google customer surveys and pay per use input.

Google charges brands 10 cents per response, pays 5 cents to the content creators and keeps the rest for enabling this three sided market.

Business model is nothing but value creation and value capture. Business model innovation means innovation in value creation, capture or both. By adding a third side with its own value creation and capture Google has created an innovative three way exchange to orchestrate the business model.
This also addresses the problem with unbundled pricing, mostly operational challenges with micro-payments and metering.

But I cannot help but notice severe sloppiness in their product and messaging.

Sample Size recommendation: Google recommends brands to sign up for 1500 responses. Their reason, “recommended for statistical significance”.
Statistical significance has no meaning for surveys unless you are doing hypothesis testing. When brands are trying to find out which diaper bag feature is important, they are not doing hypothesis testing.

What they likely mean is Confidence Interval (or margin of error at a certain confidence level). What is the margin of error, at 95% confidence level? With 1500 samples, assuming 200 million as the population size it is 2.5%. But you do not need that precise value given you already have sampling bias by opting for Google Customer Surveys. Most would do well with just 5% margin of error which requires only 385 responses or 10% which requires only 97 responses.

Recommending 1500 responses is at best a deliberate pricing anchor, at worst an error.

If they really mean hypothesis testing, one can use a survey tool for that, but it is not coming through in the rest of their messaging which is all about response collection. The 1500 responses suggestion is still questionable. For most statistical hypothesis testing 385 samples are enough (Rethinking Data Analysis published in the International Journal of Marketing Research, Vol 52, Issue 1).

Survey of one question at a time: Brands can create surveys that have multiple questions in them but respondents will only see one question at any given time.

With Google Consumer Surveys, you can run multi-question surveys by asking people one question at a time. This results in higher response rates (~40% compared with an industry standard of 0.1 – 2%) and more accurate answers.
It is not a fair comparison regarding response rate. Besides we cannot ignore the fact that the response may be just a mindless mouse click by the reader anxious to get to their article. For the same reason they cannot claim , “more accurate”.

Do not cancel your SurveyGizmo subscription yet. There is a reason why marketing researchers carefully craft a multiple question survey. They want to get responses on a per user basis, run factor analysis, segment the data using cluster analysis or run some regression analysis between survey variables.

The system will automatically look for correlations between questions and pull out hypotheses.

I am willing to believe there is a way for them to “collate” (not correlate as they say) the responses to multiple questions of same survey by each user and present as one unified response set. If you can string together responses to multiple questions on a per user basis you can do all the statistical analysis I mentioned above.<;

But I do not get what they mean by, “look for correlations between questions” and definitely don’t get, “pull out hypotheses”. It is us, the decision makers,who make the hypothesis in the hypothesis testing. We are paid to make better hypotheses that are worthy of testing.

If we accept the phrase, “pull out hypotheses”, to be true then it really means we need yet another data collection process (from a completely different source) to test the hypotheses they pulled out for us. Because you cannot use the very data you used to form a hypothesis to test it as well.

Net-Net, an elegant business model innovation with severe execution errors.

## If you cared to run the numbers – Looking beyond the beauty of Infographics

I debated whether or not to write this article. There is really no point in writing articles that point out flaws in some popular piece. Neither the authors of those posts nor the audience care. For those who care, they already understand the math and this  article adds no incremental value.

But the case in point is so egregious that it serves as a poster boy for the need for running the numbers, to test BIG claims for their veracity, and look beyond the glossy eye candies.

This one comes from VentureBeat and has a very catchy title that made 2125 people to Like it on Facebook. All of them likely just read the title and are satisfied with it or saw the colorful infographic and believed the claim without bothering to check for themselves. There is also the comfort in knowing that they are not alone in the Likes.

You can’t expect the general population to do some critical thinking or any analysis given the general lack of statistical skills and their cognitive laziness. It is the System-1 at work with a lazy System-2 (surely you bought Kahneman’s new book).

You would think the author of the article should have checked, but the poor fellow is likely a designer who can do eye-popping  infographics and cannot run tests for statistical significance. He is likely an expert in stating whether using rounded corners with certain shading is better #UX or not.

The catchy title and the subject also don’t help.

So almost everyone accept the claim for what it is.  But is there one bit of truth in VentureBeat’s claim?

Let us run the numbers here.

Without further ado, here is the title of the article that 2125 facebook people Liked.

## Women who play online games have more sex (Infographic)

How did they arrive at the claim? They looked at data collected by Harris Interactive which surveyed over 2000 adults across US. Since the survey found 57% female gamers reported having sex vs. 52% female non-gamers, it makes the bold claim in its title. Here is a picture to support the claim.

The claim supported by the beautiful picture sounds plausible?

How would you verify whether the difference is not statistical noise?

You would run a simple crosstab (chi-square test)- and there are online tools that makes this step easier. What does this mean? You will test whether the difference between the number of female gamers reported having sex and female non-gamers reporting the same is statistically significant.

The first step is to work with absolute numbers not percentages. We need numbers that 57% and 52% correspond to. For this we need number of females surveyed and what percentage are gamers and non-gamers.

The VentureBeat infographic says, “over 2000 adults surveyed”. The exact number happens to be 2132.

Let us find the number of gamers among females. The article says, of the gamers – 55% are females and 45% are males. This is not same as 55% of females are gamers. Interestingly they never reveal to us what percentage of the surveyed people are gamers. So we resort to data from other sources. One such source (circa 2008) says, 42% of population play games online. We can assume that the number is now 50%.

So the number of gamers and non-gamers is 1066 each. Then we can say (using data from the infographic)

Number of female gamers = 55% of 1066 = 587
Number of female non-gamers = ?? (it is not 1066-587)

The survey does not say number of males vs. female, but we can assume it is split evenly. If you want to be exact you can use the ratio from census.gov  which states 50.9% female to 49.1% male). So there are likely 1089 females surveyed.

That makes number of female non-gamers = 1089 – 587 = 502

The next step is find number of women reported having sex (easy to do from their graph)

Number of female gamers reported having sex = 57% of 587 = 335 (not having sex = 587-335 = 252)

Number of female non-gamers reported having sex = 52% of 502   = 261 (not = 241)

Now you are ready to build the 2X2 contingency table

Then you run the chi-square test to see if the difference between the numbers is statistically significant.

H0 (null hypothesis): The difference is just random

H1 (alternative hypothesis): The difference is not just random, more female gamers do have sex than female non-gamers.

You use the online tool and it does the work for you.

What do we see from the results? The Chi-square calculated for p-value of 0.05 (95% confidence) is 2.82. For the difference to be statistically significant the value has to be at least 3.84 (degrees of freedom =1).

Since that is not the case here, we see no reason to reject the null hypothesis that the difference is just random.

You can repeat this for their next chart that shows have sex at least 1x per week and you will find no reason to reject the null hypothesis.

So the BIG claim made by VentureBeat’s article and its colorful infographic is just plain wrong.

If you followed this far you can see that it is not easy to seek the right data and run the analysis. Most importantly it is not easy to question such claims from a popular blog. So we tend to yield to the claim, accept it, Like it, tweet it, etc.

Now that you learned to question such claims and credibly analyze it, go apply what you learned to every big claim you read.

## The Hidden Hypotheses We Take For Granted

In A/B testing,  you control for many factors and test only one hypothesis – be it  two different calls to action or two different colors for BuyNow buttons. When you find statistically significant difference in conversion rates between the two groups, you declare one version is superior to other.

Hidden in this hypothesis testing are many implicit hypotheses that we treat as truth. If any one of them prove to be not true then our conclusion from the A/B testing will be wrong.

Dave Rekuc, who runs an eCommerce site, posed a question in Avinash Kaushik’s blog post on test for statistical significance and A/B testing. Dave’s question surfaces the very issue of one such hidden hypothesis

I work for an ecommerce site that has a price range of anywhere from \$3 an item to \$299 an item. So, I feel like in some situations only looking at conversion rate is looking at 1 piece of the puzzle.

I’ve often used sales/session or tried to factor in AOV when looking at conversion, but I’ve had a lot of trouble coming up with a statistical method to ensure my tests’ relevance. I can check to see if both conversion and AOV pass a null hypothesis test, but in the case that they both do, I’m back at square one.

Dave’s question is, whether the result from the conversion test experiment hold true across all price ranges.

He is correct in stating that looking at conversion rate alone is looking at one part of the puzzle.

When  items vary in price, like he said from \$3 to \$299, the test for statistical significance of difference between conversion rates assumes an implicit hypothesis that is treated as truth.

A1: The difference in conversion rates does not differ across price ranges.

and the null hypothesis (same, just added for completeness)

H0: Any difference between the conversion rates is due to randomness

When your data tells you that H0 can or cannot be rejected, it is conditioned on the implicit assumption A1 being true.

But what if A1 is false? In Dave’s case he uncovered one. What about many other such hypotheses? Other examples include, treating the target population as the same (no male/female difference, no Geo specific difference etc) and products as the same.

I point out to two different results from the same data set by segmenting and not segmenting the population  in one of my previous posts.

That is the peril of hidden hypotheses.

What is the solution for a situation like Dave’s?  Either you explicitly test this assumption first or as simpler option, segment your data and test each segment for statistical significance. Since you have a range of price points I recommend you test over 4-5 price ranges.

What is the solution for the bigger problem of many different hidden hypotheses?

Talk to me.