## Yet Another Causation Confusion – School Ratings and Home Forclosures

Here is the headline from WSJ,

## One Antidote to Foreclosures: Good Schools

The article is based on  data mining by Location Inc, that found that

over past six months percentage of foreclosure (or “real-estate-owned”) sales went down as the school ranking went up in five metro areas

Next we see a news story attributing clear causation. This one is hard to notice for some as it appears to be longitudinal – reported over six month period. If it were a simpler cross-sectional study most will catch the fallacy right away.

First let me point out this is the problem with data mining – digging for nuggets in mountains of Big Data without an initial hypothesis and finding such causations.

Second school rating improvement could be due to random factors that coincide with lower foreclosure.

Third despite the fact that longitudinal aspect implies causation there are many omitted variables here – the common factors that are driving down foreclosure and driving up school rating.

School rating is not an independent metric. It relies not just on teacher performance  but also on parents. The same people who are willing to work with their kids are also likely to be fiscally responsible. Another controversial but a proven factor is the effect of genes on children’s performance.

Ignoring all these if we focus our resources on improving school ratings to solve foreclosure crisis, we will be chasing away the wolves that cause eclipses with loud noises.

## A Closer Look at A/B testing

Suppose you have a service – be it a web based service or a brick and mortar service.  Visitors walk through the front door. Most just leave without taking an action that is favorable to you. Some do and become converts.

As you  function along, you form a belief/gut-feel/hypothesis  that color of the door affects how many will convert.  Specifically, certain color will improve conversion. (I am color blind, else I will call out the colors I used below)

You verify this hypothesis by running a split test. You evenly split your visitor stream, randomly sending them through Door A of current color or Door B of the new color which is supposed to increase conversion. This is the A/B split test.

How do you verify your hypothesis?

The most common way that is practiced by every A/B test tool in the market is shown below

These tools keep  both Converts and Non-Converts for a given Door together and treats each as a separate population.  Those who went through Door A  (both Converts and Non-Converts) are kept separate from those who went through Door B.  They test the hypothesis that the proportion of converts in the Door B population is higher than proportion of converts in the Door A population. The tools assume that the population data are normally distributed and use a 2-sample t-test to verify the  difference between the two proportions is statistically significant.

What is wrong with this approach? For starters, you can see how it rewrites the hypothesis and re-wires the model. This approach treats conversion as an attribute of the visitor. This is using the t-test for the wrong purpose or using the wrong statistical test for A/B testing.

For example, if you want to test whether there is higher prevalence of heart disease among Indians living in US vs. India, you will draw random samples from the two populations and, measure the proportion of heart disease in each sample and do a t-test to see if the difference is statistically significant. That is a valid use of t-test for population proportions.

Conversion isn’t same as measuring proportion of population characteristic like heart disease. Treating the conversion rate as a characteristic of the visitor is contrived. You also need to keep the Converts and Non-Converts together while you only need to look at those who converted.

Is there another way?

Yes. Take a look at this model that closely aligns with the normal flow. We really do not care about the Non-Converts and we test the correct hypothesis that more Converts came through Door B than through Door A.

This method grabs a random sample of Converts and tests whether there are more that came through Door B than through Door A. It uses Chi-square test to verify that the difference is not just due to randomness. No other assumptions needed like assuming normal distribution and it tests the right hypothesis. Most importantly it fits the flow and model before we introduced Door B.

Want to know more? Want to know the implications of this and how you can influence your A/B test tool vendors to change?  Drop me a note.

## Significance of Random Flukes

One sure way to kill a joke is to explain it. I hate to kill this great and clever joke on statistical significance, but here it goes. May be you want to just read the joke, love it, treasure it and move on without reading rest of the article.

Love it? I love this one for its simple elegance. You can leave now if you do not want to see this dissected.

First the good things.

The “scientists” start with hypotheses external to the data source, collect data and test for statistical significance. They likely used 1-tailed t-test and ran a between groups experiment.

One group was control group that did not eat the jelly beans. Other group was the treatment group that was treated with jelly beans.

The null hypothesis H0 is, “ Any observed difference between the number of occurrences of acne between the two groups is just due to coincidence”.

The alternative hypothesis H1 is, “The differences are statistically significant. The jelly beans made a difference”.

They use p value of 0.05 (95% confidence). p-value of 0.05 means there is only 5% probability the observed result can be entirely due to chance. If the computed p-value is less than 0.05 (p<0.05), they reject H0 and accept H1. If the computed p-value is greater than 0.05  (p>0.05) H0 cannot be rejected, it is all random.

They run a total of 21 experiments.

The first is the aggregate. They likely used a large jar of mixed color jelly beans and ran the test and found no compelling evidence to overthrow the null hypothesis that it was just coincidence.

Then they run 20 more experiments, one for each color. They find that in 19 of the experiments (with 19 different colors) they cannot rule out coincidence. But in one experiment using green jelly beans they find p less than 0.05. They  reject H0 and accept H1 that green jelly beans made a difference in the number of occurrences of acne.

In 20 out of 21 experiments (95.23%), the results were not significant to toss out coincidence as the reason. In 1  out of 21 experiments (4.77% ) it was and hence green was linked to acne.

In other words, there was 95.23% probability (p=0.9523) that any observed link between jelly bean and acne is just random.

However the media reports, “Green jelly beans linked to acne at 95% confidence level”, because that experiment found p<0.05.

Green color is the spurious variable. The fact that the Green experiment had p<0.05 could easily be because this experiment run happened to have high concentration of random flukes in it.

The very definition of statistical significance testing using random sampling is just that.

If we have not seen the first experiment  or the 19 other experiments that had p>0.05, we would be tempted to accept the link between green jelly beans and acne. Since we saw all the negative results, we know better.

In reality, we don’t see most if not all of the negative findings. Only the positive results get written up – be it the results of an A/B test that magically increased conversion or scientific research.
After all it is not interesting to read how changing just one word did not have a effect on conversion rates. Such negative findings deserve their place in the round filing cabinet.

By rejecting all negative findings and choosing only postive findings, the experimenters violate the rule of random sampling and  highlight the high concentration of random flukes as breakthrough.

The next step in this slippery slope of pseudo statistical testing is Data Dredging. Here one skips the need for initial hypotheses and simply launches into data for finding “interesting correlations”.
Data Dredging is slicing up data in every possible dimension to find something – anything.

For example, “Eating Green jelly beans with left hand while standing up on Tuesdays” causes acne.

If you find this claim is so ridiculous that you will not fall for it, consider all the articles you have seen about the best days to run email marketing  and best days to tweet OR How to do marketing like  certain brands.

Can you spot the fact that these are based on Data Dredging?

(See here for great article on Data Dredging).

In this age of instant publication, easy experimentation, Big Data and social media echo chamber, how can you spot and stay clear of Random Flukes reported as scientific research?

1. What are the initial hypotheses before collecting data? If there is none, thanks but no thanks. (Data Dredging)
2. How are these hypotheses arrived at? If these were derived from the very data they are tested with, keep moving. A great example of this is the class of books, “7 Habits of …”
3. See a study from extremely large samples? Many of the random flukes that do not show up in small samples do get magnified in large samples. It is just due to the mathematical artifact. Again, thanks but no thanks.
4. Very narrow finding? It is the Green jelly bean again ask about other dimensions.

Or you can just plain ignore all these nonsensical findings camouflaged in analytics.

## Who Makes the Hypothesis in a Hypothesis Testing?

Most of my  works on pricing and consumer behavior studies rely on hypothesis testing.  Be it finding difference in means between two groups, non-parametric test or making a causation claim, explicitly or implicitly I apply hypothesis testing. I make overarching claims about customer willingness to pay and what factors influence it based on hypothesis testing. The same is true for the most popular topic, these days, for anyone with a web page – AB split testing. Nothing wrong with these methods and I bet I will continue to use these methods in all my other works.

We should note however  that the use of hypothesis and finding statistically significant difference should not blind us to the fact that there is some amount of subjectivity that go into all these. Another important distinction to note is, despite the name hypothesis testing we are not testing whether the hypothesis is validated but whether the data fits the hypothesis which we take it as given. More on this below.

All these testings proceed as follows:

1. Start with the hypothesis. In fact you always start with two, the null hypothesis which is the same for any statistical testing
The Null hypothesis H0: The observed difference between subjects (or groups) is just due to randomness.
Then you write down the hypothesis that you want to make a call on.
Alternative hypothesis H1: The observed difference between subjects (or groups) is indeed due to one or more treatment factors that you control for.
2. Pick the statistical test you want to use among those available given your case. Be it a non-parametric test like Chi-square  that makes no assumption about the distribution of data (AB testing) or parametric test like t-test that assumes Gaussian distribution (e.g., normal) of data.
3. Select a critical value or confidence level for the test 90%,95%, 99% with 95% being the most common. This is completely subjective. What you are stating with the critical value is the results are statistically significant only if these can be caused due to randomness in less than 5% (100-95%) of the cases. The critical value is also expressed as p value ( probability ), in this case 0.05.
4. Perform the test with random sampling. This needs more explanation but is beyond the scope of what I want to cover here.

As you can see, we the analyst/decision maker make up the hypothesis and we are treating the hypothesis as given.  We did the right thing of writing it first. ( A common mistake in many of the AB tests and in data mining exercises is writing the hypothesis after the test.)

What we are testing is, given this hypothesis H1 is true  (P(H1)=1) what is the probability the test data D fits the hypothesis.

This is expressed as P(D|H1).  Statistical significance here means P(D|H1) > 0.95 given P(H1) =1.

When we say we accept H1, we are really saying H0 (randomness) cannot be the reason and hence H1 must be true. We rule out the fact that the observed data can be explained by any number of alternative hypotheses. Since we wrote the original hypothesis, if we did not base it on proper qualitative analysis then we could be wrong despite the fact  our tests yields statistically significant results.

This is why you should never launch a survey without doing focus groups and customer interviews. This is why you don’t jump into statistical testing before understanding enough about the subjects under study to frame relevant hypothesis.  Otherwise you are, as some wrote to me, using gut feel or pulling things out of thin air and accepting it simply because there is not enough evidence in the data to overturn the null hypothesis.

How do you come up with your hypotheses?

Look for my next article on how this is different in Bayesian statistics.

## The Point Of Experimentation

There is a PBS Kids program called Dinosaur Train. In one of the episodes, the cartoon characters talk about hypothesis and experiments. The kids are told, “a hypothesis is an idea you can test and you test by doing experiments”.

If there is no hypothesis, why do you need to experiment? If you are not going to do anything different based on the experiment, do you need it?

On the other hand, if there is an hypothesis do you really need an experiment to verify it or is there already data that you can use?

If there is no new learning, do you need that experiment? Is the cost worth it?

Is there new education in the “new” kick of the mule?