## 50/50

I was lying down on dentist chair, with my dentist busy drilling out a cavity. She tells me, “I think I did all I can but tomorrow if you feel pain of any kind then it is root canal”.

I asked her, “What are the chances I need root canal procedure?”

To this she replied, “50/50”.

I said, “50/50? How kind of odds is that? Can’t you tell now whether or not you hit the nerves and hence know whether or not I need root canal? Can’t you tell from all different cavities you have filled how likely is that I will need the painful procedure?”

With a kind expression that only dentists have she replied, “It doesn’t matter about all others. Tomorrow when you eat ice cream you either have splitting pain or not. That is all matters. So it is 50/50 for you”.

The mere fact that either I will be in pain or not (2 states) was her reasoning for 50/50.

She is my dentist and I am not going to speak against here. But I am going to point you the prediction by a reporter on how likely will Amazon introduce free Kindle. His answer, as you see above, is 50/50. The fact that there are only two outcomes does not mean they are equally probable. Take for instance lottery ticket, the fact that you will either wake up as lottery winner or not does not mean your chances of winning is 50/50.

50/50? That is not odds at all. That is just someone making things up when they don’t know and/or don’t know how one assign probabilities to future events. So in Kindle case whether or not we see free Kindle a 50/50 prediction covers both cases. The reporter can always say, “I told you so”. Such a prediction is not actionable and does not constitute probabilistic thinking.

So how would make probabilistic predictions? Let me take you back to the dentist case. Suppose she has seen 1000s of people with condition close to mine and knows the outcome after she filled their cavity. Or she has access to a database of similar cases documented by many other dentists. Then based on the outcomes (prior knowledge) she could say, “of the 1000 people who has such a cavity filled 503 of them did not need root canal, so your chances are 50/50”.

What does it mean in the case of making business predictions like predicting Free Kindle? Let me take you back to the Pinterest article I wrote and the companion article.

Imagine enumerating all possible future scenarios of Amazon’s device strategy and computing the outcome. These scenarios are the equivalent of many different futures based on numerous variables like cost of production, demand, market forces, amazon’s other opportunities etc. Then we simply need to look at in what percentage of the scenarios does a free Kindle deliver more profit to Amazon than charging for it.  Only after such an analysis can we say anything about the chances of a Free Kindle.

Since no one has done such a scenario analysis, to say “50/50 chance of Free Kindle next year”, is simply pointless.

And by the way, next time you hear 50/50 it is perfectly okay to assume the speaker has no clue because in 99.9999% of the cases no one has done the scenario analysis to make such probabilistic prediction and if they have they likely will not use an expression like 50/50.

## Probability of Winning MegaMillions Vs. Probability of Dating Supermodel

\$640 million jackpot is hard to resist. Those who do not usually play the lottery are now playing, buying at least five tickets. The WSJ reports Megamillion sold two tickets for every adult in the country.

The probability of winning is still 1/175,711,536.

MarketPlace Radio tries to put this number in perspective by giving a list of things that are more likely to happen to us.

Chances of dating a supermodel?  1/88,0000

But there is a big difference in the definition of probability between these two scenarios that is lost.

Probability of winning MegaMillions is determined by simply counting all possible ticket numbers. This is the frequentist approach and it is correct in this case.

We cannot use the same frequentist approach for finding your chances of dating a supermodel. The 1/88,000 number is based on number of supermodels and number of men.  This is relevant only if we are estimating, “what is the probability that a randomly selected man from the population is dating a supermodel?”

When you want to measure your chances of dating a supermodel you need different definition – probability as a measure of uncertainty.  It is not 1/88,000. (For a more detailed discussion of this definition of probability see here.)

How can you measure this probability? If you imagined living your life 10,000 times, given all possible events that could happen and the many different choices you make, in how many such lives do you find yourself dating the supermodel? That is your probability and it is different for every individual.

On the other hand, if you do win tonight the probability of dating supermodel is 1. That is the conditional probability.

## What are the chances mom will be home when we arrive and what does this have to do with Pinterest revenue?

One of the games my 7 year old and I play while driving home from her school is guessing whether mom will be home when we arrive.

I ask,”what are chances mom will be home when we arrive?”
She would almost always reply, “50-50”
Not bad for someone just learning enumerating the possibilities and finding the fraction. When we arrive home there is either mom or not. So 50-50 seem reasonable.

But are the chances really 50-50? If not how would we find it?

Well let us start with some safety and feel good assumptions, my drive time is constant, there is mom, she always leaves at fixed time and she will arrive.
Other than that we need to know

1. What time is it now?
2. What is her mean drive time?
3. What is the standard deviation of drive time?

Assume that the drive times are normally distributed with the stated mean and standard deviation. It is then a question of finding, in what percentage of the scenarios the drive times show an earlier arrival time. That is the probability we were looking for and it is not 50-50 simply because there are only two outcomes.

Here we did a very simple model. But who knows what the mean is let alone standard deviation. We do not. So we do the next best thing, we estimate. We do not literally estimate the mean and standard deviation but we estimate  a high and the low value such that in 90% of the cases the drive time falls in that range. Stated another way, only 10% chance the drive time is outside this range.

This is the 90% confidence interval.We are 90% confident the value is in this interval. Once we have this then it is more simple math to find the mean and standard deviation.

Mean  is average of the low and high values.
Standard deviation is the difference between high and low divided by number of standard deviations the 90% probability corresponds to  in a standard normal curve (3.29σ).

One you have the mean and standard deviation you can do the math to find the percentage of scenarios where drive time is below certain value.

This is still simple. We treated drive time as the measurable quantity and worked with it. But drive time is likely made up of many different components, each a random variable of its own. For instance time to get out of parking lot, time to get on the highway, etc.  There is also the possibility the start time is no more fixed and it varies.

(If you want to build more realistic model you should also model my drive time as random variable with its own 90% confidence interval estimate. But let us not do that today.)

In such a case  instead of estimating the whole we estimate our 90% confidence intervals of all these parts. In fact this is a better  and preferred approach since we are estimating smaller values for which we can make better and tighter estimates than estimating total drive time.

How do we go from 90% confidence interval estimates of these component variables to the estimate of drive time? We run a Monte Carlo simulation to build the normal distribution of the drive time variable based on its component variables.

This is like imagining driving home 10,000 times.  For each iteration randomly pick a value for each one of the component variable based on their normal distribution (mean and sigma) and add them up:

drive time (iteration n) = exit time (n) + getting to highway time (n) + …

Once you have these 10,000 drive times then find what percentage of the scenarios have drive time less than certain value. That is the probability we were looking for.

From this we could say, “there is 63% chance mom will be home when we arrive”.

We could also say, “there is only 5% chance mom will arrive 30 minutes after we arrive”.

When we know there is roadwork starting on a segment we can add another delay component (based on its 90% confidence interval) and rerun the simulation.

That is the power of statistical modeling to estimate any unknowns based on our initial estimates and state our level of confidence on the final answer.

Now what does this have to do with Pinterest revenue?

## Fail fast because successful companies failed before they succeeded

There are several versions of this statement, one way or another they glorify failures and in the name of exhorting startup founders these inspirational statements lead one to believe

1. After a few failures success is inevitable
2. You must fail first to succeed
3. Fail fast so you can succeed
4. Failures signal impending success
5. “Failure can be a true blessing in that it educates you and prepares you for success” (from here)
6. “Remember that most successful entrepreneurs fail good and hard before they finally make it” (same source)

All these assertions are happy to point out popular examples. The problem is the assertions are derived from the very examples they are using as evidence.

First let us make something very clear. Success and Failure are the only two possible outcomes for any venture you undertake. But the fact that there are just two outcomes does not mean they are equally probable. It is not the case of tossing a fair coin and calculating the odds of heads or tails.The chances of success and failure can be and are very different. If you take the base rate (looking at the success rate of thousands of ventures and small businesses) the success rate is 3 to 5%.

Second  even if we assume that Success and Failure are equally likely, a series of failures does not mean inevitable success. Take the coin example. The probability of getting 10 Tails in a row is same as the probability of getting 9 Tails in a row followed by a Head.

Lastly the fact that those who succeeded had failed in the past is irrelevant. Those who make such an argument pick only the success stories that are popular, recent and available to them. When you only look at those who succeeded and are still in business you are leaving all those who did finally succeed and gave up or still trying without success. Even in these cited success stories success is mostly random rather than a result of their failures. The fact that those who succeeded had “failed hard” does not mean when you fail you will succeed.

Granted they learned from their mistakes but you do not have to learn from your own mistakes.  You do not have to fail to learn. Failure is not the true blessing. Insane success with hundreds of billions of valuation even when your venture has no real product or clear value add is true blessing.

Those who advise you to fail are not being intellectually honest. Their advices are no different from those advising a gambler to bet on a slot machine that had been coming up empty for the past few hours.

## Running a survey? Using a raffle to increase response rate?

Before you read on.  Take a moment to stop and think about the options. You are running a survey and decided to use raffle and not pay-per-response method to get your target customers to respond to your survey. The question then is which raffle, for the same amount, will get you better response rate?

You do not have resources to run both. You are going to pick one method.

Let us do the math first to see which option offers better expected value to your respondents.

Assume 100 customers.

Option A: The chances of winning is 1/100. So the expected value of \$250 prize pot is \$2.5.

Option B: There are 10 chances to win (no duplicates).  The prize port remains \$25 for all 10 chances but the probability changes.
For the first chance it is 1/100.
For the second it is 1/99

For the 10th it is 1/91

The expected value is \$2.62, a tad more than\$2.5.

If your respondent were presented with these two options and asked to pick then they might choose Option B.

But notice that your respondent does not get to see both options. They either see 1/100 chance to win \$250 or a little better than 1/100 chance to win \$25.

We are not good in math and in case of handling probabilities we tend to focus on magnitude of the prize pot over the chances of winning. You can see this behavior when Power Ball or other lottery pot rises past \$50 million.

So a 1/100 chance to win \$250 will look more attractive than the \$25 option even if its expected value is higher.

You will most likely receive better response rate by giving all the \$250 to one lucky respondent than splitting it over 10 people.

You think otherwise? I am happy to run this test for you. I just need \$500 for the prize pot and \$500 for my fee.

## Results from the Quiz on Probability of Retweet

Remember the question I posed some time back on finding probability of a tweet with a link being retweeted? The quiz was a fun way to make the audience realize for themselves the futility of any tips they see on improving retweets. 390 people took the quiz and answered it, (40 because I asked, 350 because Avinash Kaushik asked).

Here are the results. Big thanks to SurveyGizmo for its amazing survey platform. The reports pretty much write themselves. You should not even try anything else for running your surveys. For all percentages you see the base is 390 responses.

For the first question I presented the only data that I saw in the report I quote. The answer distribution looks like this

One could say, close to two third are likely to believe and accept whatever is implied by the commentary associated with a partial finding. While 36% asked for more data  only a third of them asked for the right data, the percentage of tweets retweeted, that will help them answer.

After they answered the first question I provided the additional data, percentage of RTs. I provided as optios the correct answer , the two wrong answers from previous question and two bogus answers. The answer distribution looks like this

About one in five found the answer (answer is 16%). It is likely that, even in the presence of additional data,  four out of five people can be convinced to accept a different answer. For example, when a spurious conclusion is presented in the form of a fancy infographic or presented by someone popular. When you see 5000 people tweeted an infographic that talks about scientific ways to improve retweets, it is hard to stop and do the math.

The takeaway is, it is hard for folks to stop and take a critical look at social media findings reported. It is even harder to seek the right additional data and do the math. So most yield to mental shortcuts and answer the easy question.

Note that this 16% number is calculated only to show you what is the average.  But average hides segments. Likely there are multiple segments here. For some, link or not, everyone of their tweet may be retweeted.  The only takeaway is this is a probability calculation and not a recipe and as we collect more data the probability will change.