How do VCs decide to invest in a startup – Regression Analysis -Part 1

This is a multi-part article. I decided to make it lot more technical (as in statistical analysis) so this would not only serve as a prediction model for startup funding but also serve as a model for you when you see similar such predictions. You will not see the results of the regression analysis in this article but you can read it all here if you can answer a statistics question.

Imagine you were asked to invest in ten startups. Given numerical ratings on the Team, Product, Market and Traction but knowing nothing about the specifics of the team, the exact product or the domain they play in, can you pick those that actually received a term sheet? Take this quiz and see how you do.

What characteristics of a startup make it attractive for venture capitalists to invest in it? If you are a startup founder preparing for that pitch, wouldn’t it be nice to know the answer so you can prepare well to maximize your chances of getting that coveted term sheet? For those who are listening, there is no scarcity of advice. Everyone from VCs, startup founders who secured funding at significant valuations and others on the sideline, all have something to say.

Are any of these relevant to startup founders? What is noise and what is signal? Do any of these have hard numbers behind them?

Until now there was no hard quantitative data on startups that pitch to VCs and the outcome. Thanks to data from Jay Jamison, partner at BlueRun Ventures, I have data on 216 startups that pitched to his firm. Jamison rated them on four metrics, Team, Product, Market and Traction using a 5-point scale and also noted the outcome of their pitch. The outcome is rated as likelihood of getting term sheet on a five-point scale, with 5 meaning they got it.

Armed with this data we now can model if any of these traits of a startup influence its ability to get term sheet using statistical analysis. While Jamison did his initial analysis himself, it was not rigorous enough and pointed to incorrect reasons. He later shared his data with me that enabled me to do not one but two ways of analysis of this data to come up with a prediction model.

The results indeed hold surprises compared to his previousanalysis. In the next part I will go into details of regression analysis, its metrics and pitfalls.

Again, see if you can predict which startups got funding, take this quiz.

 

This regression model is beautiful and correctly used

Yes, the X and Y are switched. But think about it, it is just convention

It appears there is new found craze for regression analysis in tech blogs. Ok may be not, there was just this article and then another in Techcrunch. If you really want to understand and admire a fantastic use of regression analysis you should read this article by WSJ’s Number’s Guy, Carl Bialik.

Carl set out to find why the San Diego chargers had yet another bad 2011 season, winning fewer games than their talent and scoring would otherwise suggest,

“What’s frustrating about San Diego’s poor starts in recent seasons isn’t just that the team appears to have had too much talent to lose so many early games. It’s that the Chargers also outscore opponents by too much to explain their relatively poor record.”

Carl’s hypothesis, a team’s winning percentage for the season must be highly correlated with its cumulative  (for the season) margin of victory.

Win percentage =   Constant + Slope  X  Cumulative margin of victory

Note that this is not a predictive model about next game nor about whether a team will have winning season. It is simply a model of expected Win percentage  given the cumulative scores. No causation whatsoever is implied. Unlike the faulty model we saw, data already exist and not coded after the fact.

How you would use the model?

At mid-season you enter a team’s cumulative margin of victory (total points scored less total points against) and find the Win percentage suggested by the model. If the actual number is significantly lower than the one suggested by the model, as in the case of 2011 San Diego chargers, you look for explanations for the poor record. At the outset it signals wide variance in team’s performance – when they win, they win big and when they lose they lose the close ones. Then you look for reasons and fix them.

This example is by far the best out there in correct usage of regression, which by definition means looking backwards. This model only looks backwards and it does not predict the future based on past events. And in doing so it treats regression for what it is, correlation and hence accounts for all kinds of biases.