Big Data predicts people who are promoted often quit anyway – But …

I saw this study from HP that used analytics (okay Big Data analytics, whatever that means here) to predict employee attrition.

HP data scientists believed a companywide implementation of the system could deliver $300 million in potential savings “related to attrition replacement and productivity,

I must say that unlike most data dredging that goes on with selective reporting these data scientists started with a clear goal in mind and a decision to change before diving into data analysis. It is not the usual

“Storage and compute are cheap. Why throw away any data? Why make a decision of what is important and what is not? Why settle for sampling when you can analyze them all? Let us throw in Hadoop and we will find something interesting”

Their work found,

Those employees who had been promoted more times were more likely to quit, unless a more significant pay hike had gone along with the promotion

The problem? Well this is not the hypothesis they developed independent of data and then collected data to test this. That is the prescribed approach to hypothesis driven data analysis. Even with that method one cannot stop when data fits the hypothesis because data can fit any number of plausible hypotheses.

The problem is magnified with Big Data where even tiny correlations get reported because of sheer volume of data.

What does it mean that people who are promoted often quit?

Is it the frequent promotion that is the culprit? Isn’t it likely that those who are driven and high value-add more likely to get promoted often,  likely to want to take on new challenges and also look attractive to other companies?

The study adds, “unless associated with a more significant pay hike”.

Isn’t it more likely that either the company is simply using titles to keep disgruntled employees happy or just making up titles to hold on to high performance employees without really paying them for the value add? In either case, aren’t the employees more like to leave after few namesake promotions that really don’t mean anything?

Let us look at the flip side. Why are people who are not promoted frequently end up staying?  Why do companies give big raises to keep those who were promoted?

Will stopping frequent promotion stop the attrition? Or will frequent promotion with big pay raises stop it? Neither will have an effect.

The study and the analysis fail to ask
Is the business better off paying big raises to keep those who are frequently promoted than letting them leave?

Is the business better off if those who are not promoted often choose to stay?

That is the problem with this study and with Big Data analytics that do not start with a hypothesis developed outside of the very data they use to prove it. It finds the first interesting correlation, “frequent promotions associated with attrition” and declares predictability without getting into alternate explanations and root cause.

Big Data does not mean suspension of mind and eradication of theory. The right flow remains

  1. What is the decision you are trying to change?
  2. What are the hypotheses about drivers- developed by application of mind and prior knowledge?
  3. What is the right data to test?
  4. When data fits hypothesis could there be alternate hypotheses that could explain the data?
  5. How does the hypotheses change when new data comes in?

How do you put Big Data to work?

Yet Another Causation Confusion – School Ratings and Home Forclosures

Here is the headline from WSJ,

One Antidote to Foreclosures: Good Schools

The article is based on  data mining by Location Inc, that found that

over past six months percentage of foreclosure (or “real-estate-owned”) sales went down as the school ranking went up in five metro areas

Next we see a news story attributing clear causation. This one is hard to notice for some as it appears to be longitudinal – reported over six month period. If it were a simpler cross-sectional study most will catch the fallacy right away.

First let me point out this is the problem with data mining – digging for nuggets in mountains of Big Data without an initial hypothesis and finding such causations.

Second school rating improvement could be due to random factors that coincide with lower foreclosure.

Third despite the fact that longitudinal aspect implies causation there are many omitted variables here – the common factors that are driving down foreclosure and driving up school rating.

School rating is not an independent metric. It relies not just on teacher performance  but also on parents. The same people who are willing to work with their kids are also likely to be fiscally responsible. Another controversial but a proven factor is the effect of genes on children’s performance.

Ignoring all these if we focus our resources on improving school ratings to solve foreclosure crisis, we will be chasing away the wolves that cause eclipses with loud noises.

Let us hunt for something interesting in this data gold mine

How many times have you heard this?

We are collecting a lot of data on our customers/transactions/sales/logs,  let us look at this goldmine to see if we can find anything interesting .

The problem with seeking something interesting is you are bound to find it. You might call the next statement tautological, but the fact is if it is interesting you are bound to find.  To a determined data-miner any interesting statistical outlier will eventually show up, then it is simply writing the hypothesis prediction.

Data mining or as some might call it data trolling is looking for  patterns from data sitting around, as opposed to deliberate decision making which requires seeking specific information for reducing uncertainty. But data can fit any number of hypotheses. However mining for a cause, we are bound to pick

  • Ones that are most convenient, like the man searching for lost key under the light.
  • Ones that are familiar, based on our past experience and our beliefs – well there are many fables about this.

The way to make informed decisions is to frame hypothesis based on the best prior knowledge we have. Know that this is just an hypothesis, not a fact and has uncertainties associated with it. Then collect specific data to refine it and reduce the uncertainty.

We will never know all the facts with certainty, but if we realize that what we know has uncertainty associated with it and there could be far more that we do not yet know, we are on the right track.

How do you make your decisions?