This regression model is beautiful and correctly used

Yes, the X and Y are switched. But think about it, it is just convention

It appears there is new found craze for regression analysis in tech blogs. Ok may be not, there was just this article and then another in Techcrunch. If you really want to understand and admire a fantastic use of regression analysis you should read this article by WSJ’s Number’s Guy, Carl Bialik.

Carl set out to find why the San Diego chargers had yet another bad 2011 season, winning fewer games than their talent and scoring would otherwise suggest,

“What’s frustrating about San Diego’s poor starts in recent seasons isn’t just that the team appears to have had too much talent to lose so many early games. It’s that the Chargers also outscore opponents by too much to explain their relatively poor record.”

Carl’s hypothesis, a team’s winning percentage for the season must be highly correlated with its cumulative  (for the season) margin of victory.

Win percentage =   Constant + Slope  X  Cumulative margin of victory

Note that this is not a predictive model about next game nor about whether a team will have winning season. It is simply a model of expected Win percentage  given the cumulative scores. No causation whatsoever is implied. Unlike the faulty model we saw, data already exist and not coded after the fact.

How you would use the model?

At mid-season you enter a team’s cumulative margin of victory (total points scored less total points against) and find the Win percentage suggested by the model. If the actual number is significantly lower than the one suggested by the model, as in the case of 2011 San Diego chargers, you look for explanations for the poor record. At the outset it signals wide variance in team’s performance – when they win, they win big and when they lose they lose the close ones. Then you look for reasons and fix them.

This example is by far the best out there in correct usage of regression, which by definition means looking backwards. This model only looks backwards and it does not predict the future based on past events. And in doing so it treats regression for what it is, correlation and hence accounts for all kinds of biases.

 

What are the chances mom will be home when we arrive and what does this have to do with Pinterest revenue?

Update: This article will help you understand my Gigaom guest post on Pinterest revenue: How much does Pinterest actually make?

One of the games my 7 year old and I play while driving home from her school is guessing whether mom will be home when we arrive.

I ask,”what are chances mom will be home when we arrive?”
She would almost always reply, “50-50”
Not bad for someone just learning enumerating the possibilities and finding the fraction. When we arrive home there is either mom or not. So 50-50 seem reasonable.

But are the chances really 50-50? If not how would we find it?

Well let us start with some safety and feel good assumptions, my drive time is constant, there is mom, she always leaves at fixed time and she will arrive.
Other than that we need to know

  1. What time is it now?
  2. What is her mean drive time?
  3. What is the standard deviation of drive time?

Assume that the drive times are normally distributed with the stated mean and standard deviation. It is then a question of finding, in what percentage of the scenarios the drive times show an earlier arrival time. That is the probability we were looking for and it is not 50-50 simply because there are only two outcomes.

Here we did a very simple model. But who knows what the mean is let alone standard deviation. We do not. So we do the next best thing, we estimate. We do not literally estimate the mean and standard deviation but we estimate  a high and the low value such that in 90% of the cases the drive time falls in that range. Stated another way, only 10% chance the drive time is outside this range.

This is the 90% confidence interval.We are 90% confident the value is in this interval. Once we have this then it is more simple math to find the mean and standard deviation.

Mean  is average of the low and high values. 
Standard deviation is the difference between high and low divided by number of standard deviations the 90% probability corresponds to  in a standard normal curve (3.29σ).

One you have the mean and standard deviation you can do the math to find the percentage of scenarios where drive time is below certain value.

This is still simple. We treated drive time as the measurable quantity and worked with it. But drive time is likely made up of many different components, each a random variable of its own. For instance time to get out of parking lot, time to get on the highway, etc.  There is also the possibility the start time is no more fixed and it varies.

(If you want to build more realistic model you should also model my drive time as random variable with its own 90% confidence interval estimate. But let us not do that today.)

In such a case  instead of estimating the whole we estimate our 90% confidence intervals of all these parts. In fact this is a better  and preferred approach since we are estimating smaller values for which we can make better and tighter estimates than estimating total drive time.

How do we go from 90% confidence interval estimates of these component variables to the estimate of drive time? We run a Monte Carlo simulation to build the normal distribution of the drive time variable based on its component variables.

This is like imagining driving home 10,000 times.  For each iteration randomly pick a value for each one of the component variable based on their normal distribution (mean and sigma) and add them up:

drive time (iteration n) = exit time (n) + getting to highway time (n) + …

Once you have these 10,000 drive times then find what percentage of the scenarios have drive time less than certain value. That is the probability we were looking for.

From this we could say, “there is 63% chance mom will be home when we arrive”.

We could also say, “there is only 5% chance mom will arrive 30 minutes after we arrive”.

When we know there is roadwork starting on a segment we can add another delay component (based on its 90% confidence interval) and rerun the simulation.

That is the power of statistical modeling to estimate any unknowns based on our initial estimates and state our level of confidence on the final answer.

Now what does this have to do with Pinterest revenue?

Read my article in Gigaom

How confident are you of your estimate?

This morning I was listening to NPR story on their new estimate of oil spill. The reporter was asking the a professor from Purdue about his now famous estimate that the oil spill is  70,000 barrels a day, way more than the Government estimate of 5000 barrels per day. Here is a snippet of the conversation:

Reporter: What is your estimate?
Professor: 70,000 barrels per day
Reporter: What is the margin of error?
Professor: About 20%
Reporter: That means the spill is anywhere between 56,000 to 84,000 barrels a day
Professor: That’s right.
Reporter: How confident are you of the estimate
Professor:  Pretty confident

Pretty confident?  I was disappointed. I was not sure if the professor deliberately chose a generic language of “pretty confident” vs. what he would normally use with his estimates – using the confidence level of 95%.

Be it estimating oil spill, sales increase, marketing conversion or a person’s height, confidence level (90%, 95% or 99%) and margin of error are very two useful and relevant qualifiers you must provide (and if we assume all the estimates are normally distributed about the real parameter). I particularly prefer 95% confidence level because the margin of error is almost equal to 2 sigmas (standard deviations).

Suppose the professor was 95% confident, he is saying the chances of the oil spill is lower than 56,000 barrels is less than 2.5%. So according to him, the chance that the oil spill is really 5000 barrels a day is 0.0000000000000000000083, a 9 sigma event!