## This regression model is beautiful and correctly used

It appears there is new found craze for regression analysis in tech blogs. Ok may be not, there was just this article and then another in Techcrunch. If you really want to understand and admire a fantastic use of regression analysis you should read this article by WSJ’s Number’s Guy, Carl Bialik.

Carl set out to find why the San Diego chargers had yet another bad 2011 season, winning fewer games than their talent and scoring would otherwise suggest,

“What’s frustrating about San Diego’s poor starts in recent seasons isn’t just that the team appears to have had too much talent to lose so many early games. It’s that the Chargers also outscore opponents by too much to explain their relatively poor record.”

Carl’s hypothesis, a team’s winning percentage for the season must be highly correlated with its cumulative  (for the season) margin of victory.

Win percentage =   Constant + Slope  X  Cumulative margin of victory

Note that this is not a predictive model about next game nor about whether a team will have winning season. It is simply a model of expected Win percentage  given the cumulative scores. No causation whatsoever is implied. Unlike the faulty model we saw, data already exist and not coded after the fact.

How you would use the model?

At mid-season you enter a team’s cumulative margin of victory (total points scored less total points against) and find the Win percentage suggested by the model. If the actual number is significantly lower than the one suggested by the model, as in the case of 2011 San Diego chargers, you look for explanations for the poor record. At the outset it signals wide variance in team’s performance – when they win, they win big and when they lose they lose the close ones. Then you look for reasons and fix them.

This example is by far the best out there in correct usage of regression, which by definition means looking backwards. This model only looks backwards and it does not predict the future based on past events. And in doing so it treats regression for what it is, correlation and hence accounts for all kinds of biases.