Pricing Strategy Vs. Pricing Parlor Tricks

A research paper published in Journal of Consumer Research, Jan 2012, found that how we present pricing affects perception

Presenting item quantity information before price (70 songs for $29) may  make the deal appear much more appealing than if the price were presented first ($29 for  70 songs).

There are many similar peer reviewed research reports that found behaviors like,

Customers are more likely to prefer prices ending with digit 9

Customers are immune to higher prices when you don’t show the $ sign

Customers pay higher prices when you write the price in words instead of numbers

Customers succumb to decoy pricing (present three options but one is asymmetrically dominated by other and hence a decoy)

Through books and TED talks these  academic reports seep into popular media and are presented as pricing lessons for businesses small and large, especially for startups. After all, these are peer reviewed research reports based on controlled experiments that found statistically significant difference, published in reputable journals and hence worthy of our trust?

May be these are true, but what do they tell us about the customers and their needs? What job is your customer hiring your product for when they pay this cleverly presented price?

The problem is these behavioral pricing tactics may just be statistical anomalies. Let me point you to a xkcd  comic that so nicely makes the point I am about to make . After what xkcd has to say, anything I say below is redundant.

Let us take the first research I quoted, “70 songs for $29 vs. $29 for 70 songs”. What could be wrong here?  Well, why specifically 70 and 29?  What other combinations did the researchers test and what are the outcomes? What about 60 for 25, 50 for 20 etc etc.

Is it possible that they had tested 20 different combinations and found that just this one produced statistically significant difference? (Like the green jelly beans in xkcd comic?). Did the researchers stash away all the experiments that produced no results and published  the one that produced this interesting result?

An opinion piece in Business Strategy Review, published by London School of Economics, pretty much says this is the case with most research we read.

The problem is that if you have collected a whole bunch of data and you don’t find anything or at least nothing really interesting and new, no journal is going to publish it.

Because journals will only publish novel, interesting findings – and therefore researchers only bother to write up seemingly intriguing counterintuitive findings – the chance that what they eventually are publishing is BS unwittingly is vast.

Pretty much we cannot trust any of the research we read.

What are likely statistical flukes get published as interesting findings on pricing and find their way into books, TED talks and blogs. The rest don’t even leave researcher’s desk. Let alone academic journal, try writing a blog post that reports, “found no statistically significant difference”. Who will read that?

What we are seeing is publication bias that is worse than any sampling bias or analysis bias and a prevalence of pricing parlor tricks presented as authoritative lessons in pricing for businesses.

When it comes to pricing your product, be it pricing cupcakes or a webapp, you would do well to look past these parlor tricks and start with the basics.

Pricing strategy starts with customer segments and their needs. You cannot serve all segments, you need to make choices. Choose the segments you can target and deliver them a product at a price they are willing to pay.

As boring and dull as it may sound, that is pricing strategy. Your business will do well to start with the most boring and dull than chasing the latest parlor trick based on selective reporting.

Everything else is distraction. May be these fine tunings have some effect but not before strategy. After you get your foundation right, then you can worry about what font to use in the sign board.

How do you set your pricing?

Other Readings:

  1. Segment-Version Fit
  2. Five Ways Startups Get Pricing Wrong
  3. Small Business Pricing
  4. Three Components of Effective Pricing
  5. Approximate Guide to Pricing Webapps  (buy access for 99 cents, pun intended )

Google Customer Surveys – True Business Model Innovation, But

Summary:Great business model innovation that points to the future of unbundled pricing. But is Google customer survey an effective marketing research tool? Do not cancel SurveyGizmo subscription yet.

Google’s new service, Customer Surveys, is truly a business model innovation. It unlocks value by creating a three sided market:

  1. Content creators who want to monetize their content in an unbundled fashion (charge per article, charge per access etc)
  2. Readers who want access to paid content without having to subscribe for entire content or muddle through micro-payments (pay per access)
  3. Brands seeking customer insights, willing to pay for it but have been unable to find a reliable or cheaper way to get this
When readers want to access premium content they can get it by answering a question posed by one of the brands instead of paying for access. Brands create surveys using Google customer surveys and pay per use input.

Google charges brands 10 cents per response, pays 5 cents to the content creators and keeps the rest for enabling this three sided market.

Business model is nothing but value creation and value capture. Business model innovation means innovation in value creation, capture or both. By adding a third side with its own value creation and capture Google has created an innovative three way exchange to orchestrate the business model.
This also addresses the problem with unbundled pricing, mostly operational challenges with micro-payments and metering.

But I cannot help but notice severe sloppiness in their product and messaging.

Sample Size recommendation: Google recommends brands to sign up for 1500 responses. Their reason, “recommended for statistical significance”.
Statistical significance has no meaning for surveys unless you are doing hypothesis testing. When brands are trying to find out which diaper bag feature is important, they are not doing hypothesis testing.

What they likely mean is Confidence Interval (or margin of error at a certain confidence level). What is the margin of error, at 95% confidence level? With 1500 samples, assuming 200 million as the population size it is 2.5%. But you do not need that precise value given you already have sampling bias by opting for Google Customer Surveys. Most would do well with just 5% margin of error which requires only 385 responses or 10% which requires only 97 responses.

Recommending 1500 responses is at best a deliberate pricing anchor, at worst an error.

If they really mean hypothesis testing, one can use a survey tool for that, but it is not coming through in the rest of their messaging which is all about response collection. The 1500 responses suggestion is still questionable. For most statistical hypothesis testing 385 samples are enough (Rethinking Data Analysis published in the International Journal of Marketing Research, Vol 52, Issue 1).

Survey of one question at a time: Brands can create surveys that have multiple questions in them but respondents will only see one question at any given time.
Google says,

With Google Consumer Surveys, you can run multi-question surveys by asking people one question at a time. This results in higher response rates (~40% compared with an industry standard of 0.1 – 2%) and more accurate answers.
It is not a fair comparison regarding response rate. Besides we cannot ignore the fact that the response may be just a mindless mouse click by the reader anxious to get to their article. For the same reason they cannot claim , “more accurate”.

Do not cancel your SurveyGizmo subscription yet. There is a reason why marketing researchers carefully craft a multiple question survey. They want to get responses on a per user basis, run factor analysis, segment the data using cluster analysis or run some regression analysis between survey variables.

Google says,

The system will automatically look for correlations between questions and pull out hypotheses.

I am willing to believe there is a way for them to “collate” (not correlate as they say) the responses to multiple questions of same survey by each user and present as one unified response set. If you can string together responses to multiple questions on a per user basis you can do all the statistical analysis I mentioned above.<;

But I do not get what they mean by, “look for correlations between questions” and definitely don’t get, “pull out hypotheses”. It is us, the decision makers,who make the hypothesis in the hypothesis testing. We are paid to make better hypotheses that are worthy of testing.

If we accept the phrase, “pull out hypotheses”, to be true then it really means we need yet another data collection process (from a completely different source) to test the hypotheses they pulled out for us. Because you cannot use the very data you used to form a hypothesis to test it as well.

Net-Net, an elegant business model innovation with severe execution errors.

If you cared to run the numbers – Looking beyond the beauty of Infographics

I debated whether or not to write this article. There is really no point in writing articles that point out flaws in some popular piece. Neither the authors of those posts nor the audience care. For those who care, they already understand the math and this  article adds no incremental value.

But the case in point is so egregious that it serves as a poster boy for the need for running the numbers, to test BIG claims for their veracity, and look beyond the glossy eye candies.

This one comes from VentureBeat and has a very catchy title that made 2125 people to Like it on Facebook. All of them likely just read the title and are satisfied with it or saw the colorful infographic and believed the claim without bothering to check for themselves. There is also the comfort in knowing that they are not alone in the Likes.

You can’t expect the general population to do some critical thinking or any analysis given the general lack of statistical skills and their cognitive laziness. It is the System-1 at work with a lazy System-2 (surely you bought Kahneman’s new book).

You would think the author of the article should have checked, but the poor fellow is likely a designer who can do eye-popping  infographics and cannot run tests for statistical significance. He is likely an expert in stating whether using rounded corners with certain shading is better #UX or not.

The catchy title and the subject also don’t help.

So almost everyone accept the claim for what it is.  But is there one bit of truth in VentureBeat’s claim?

Let us run the numbers here.

Without further ado, here is the title of the article that 2125 facebook people Liked.

Women who play online games have more sex (Infographic)

How did they arrive at the claim? They looked at data collected by Harris Interactive which surveyed over 2000 adults across US. Since the survey found 57% female gamers reported having sex vs. 52% female non-gamers, it makes the bold claim in its title. Here is a picture to support the claim.

The claim supported by the beautiful picture sounds plausible?

How would you verify whether the difference is not statistical noise?

You would run a simple crosstab (chi-square test)- and there are online tools that makes this step easier. What does this mean? You will test whether the difference between the number of female gamers reported having sex and female non-gamers reporting the same is statistically significant.

The first step is to work with absolute numbers not percentages. We need numbers that 57% and 52% correspond to. For this we need number of females surveyed and what percentage are gamers and non-gamers.

The VentureBeat infographic says, “over 2000 adults surveyed”. The exact number happens to be 2132.

Let us find the number of gamers among females. The article says, of the gamers – 55% are females and 45% are males. This is not same as 55% of females are gamers. Interestingly they never reveal to us what percentage of the surveyed people are gamers. So we resort to data from other sources. One such source (circa 2008) says, 42% of population play games online. We can assume that the number is now 50%.

So the number of gamers and non-gamers is 1066 each. Then we can say (using data from the infographic)

Number of female gamers = 55% of 1066 = 587
Number of female non-gamers = ?? (it is not 1066-587)

The survey does not say number of males vs. female, but we can assume it is split evenly. If you want to be exact you can use the ratio from  which states 50.9% female to 49.1% male). So there are likely 1089 females surveyed.

That makes number of female non-gamers = 1089 – 587 = 502

The next step is find number of women reported having sex (easy to do from their graph)

Number of female gamers reported having sex = 57% of 587 = 335 (not having sex = 587-335 = 252)

Number of female non-gamers reported having sex = 52% of 502   = 261 (not = 241)

Now you are ready to build the 2X2 contingency table

Then you run the chi-square test to see if the difference between the numbers is statistically significant.

H0 (null hypothesis): The difference is just random

H1 (alternative hypothesis): The difference is not just random, more female gamers do have sex than female non-gamers.

You use the online tool and it does the work for you.

What do we see from the results? The Chi-square calculated for p-value of 0.05 (95% confidence) is 2.82. For the difference to be statistically significant the value has to be at least 3.84 (degrees of freedom =1).

Since that is not the case here, we see no reason to reject the null hypothesis that the difference is just random.

You can repeat this for their next chart that shows have sex at least 1x per week and you will find no reason to reject the null hypothesis.

So the BIG claim made by VentureBeat’s article and its colorful infographic is just plain wrong.

If you followed this far you can see that it is not easy to seek the right data and run the analysis. Most importantly it is not easy to question such claims from a popular blog. So we tend to yield to the claim, accept it, Like it, tweet it, etc.

Now that you learned to question such claims and credibly analyze it, go apply what you learned to every big claim you read.

Significance of Random Flukes

One sure way to kill a joke is to explain it. I hate to kill this great and clever joke on statistical significance, but here it goes. May be you want to just read the joke, love it, treasure it and move on without reading rest of the article.

Love it? I love this one for its simple elegance. You can leave now if you do not want to see this dissected.

First the good things.

The “scientists” start with hypotheses external to the data source, collect data and test for statistical significance. They likely used 1-tailed t-test and ran a between groups experiment.

One group was control group that did not eat the jelly beans. Other group was the treatment group that was treated with jelly beans.

The null hypothesis H0 is, “ Any observed difference between the number of occurrences of acne between the two groups is just due to coincidence”.

The alternative hypothesis H1 is, “The differences are statistically significant. The jelly beans made a difference”.

They use p value of 0.05 (95% confidence). p-value of 0.05 means there is only 5% probability the observed result can be entirely due to chance. If the computed p-value is less than 0.05 (p<0.05), they reject H0 and accept H1. If the computed p-value is greater than 0.05  (p>0.05) H0 cannot be rejected, it is all random.

They run a total of 21 experiments.

The first is the aggregate. They likely used a large jar of mixed color jelly beans and ran the test and found no compelling evidence to overthrow the null hypothesis that it was just coincidence.

Then they run 20 more experiments, one for each color. They find that in 19 of the experiments (with 19 different colors) they cannot rule out coincidence. But in one experiment using green jelly beans they find p less than 0.05. They  reject H0 and accept H1 that green jelly beans made a difference in the number of occurrences of acne.

In 20 out of 21 experiments (95.23%), the results were not significant to toss out coincidence as the reason. In 1  out of 21 experiments (4.77% ) it was and hence green was linked to acne.

In other words, there was 95.23% probability (p=0.9523) that any observed link between jelly bean and acne is just random.

However the media reports, “Green jelly beans linked to acne at 95% confidence level”, because that experiment found p<0.05.

Green color is the spurious variable. The fact that the Green experiment had p<0.05 could easily be because this experiment run happened to have high concentration of random flukes in it.

The very definition of statistical significance testing using random sampling is just that.

If we have not seen the first experiment  or the 19 other experiments that had p>0.05, we would be tempted to accept the link between green jelly beans and acne. Since we saw all the negative results, we know better.

In reality, we don’t see most if not all of the negative findings. Only the positive results get written up – be it the results of an A/B test that magically increased conversion or scientific research.
After all it is not interesting to read how changing just one word did not have a effect on conversion rates. Such negative findings deserve their place in the round filing cabinet.

By rejecting all negative findings and choosing only postive findings, the experimenters violate the rule of random sampling and  highlight the high concentration of random flukes as breakthrough.

The next step in this slippery slope of pseudo statistical testing is Data Dredging. Here one skips the need for initial hypotheses and simply launches into data for finding “interesting correlations”.
Data Dredging is slicing up data in every possible dimension to find something – anything.

For example, “Eating Green jelly beans with left hand while standing up on Tuesdays” causes acne.

If you find this claim is so ridiculous that you will not fall for it, consider all the articles you have seen about the best days to run email marketing  and best days to tweet OR How to do marketing like  certain brands.

Can you spot the fact that these are based on Data Dredging?

(See here for great article on Data Dredging).

In this age of instant publication, easy experimentation, Big Data and social media echo chamber, how can you spot and stay clear of Random Flukes reported as scientific research?

You can start with this framework:

  1. What are the initial hypotheses before collecting data? If there is none, thanks but no thanks. (Data Dredging)
  2. How are these hypotheses arrived at? If these were derived from the very data they are tested with, keep moving. A great example of this is the class of books, “7 Habits of …”
  3. See a study from extremely large samples? Many of the random flukes that do not show up in small samples do get magnified in large samples. It is just due to the mathematical artifact. Again, thanks but no thanks.
  4. Very narrow finding? It is the Green jelly bean again ask about other dimensions.

Or you can just plain ignore all these nonsensical findings camouflaged in analytics.


The Hidden Hypotheses We Take For Granted

In A/B testing,  you control for many factors and test only one hypothesis – be it  two different calls to action or two different colors for BuyNow buttons. When you find statistically significant difference in conversion rates between the two groups, you declare one version is superior to other.

Hidden in this hypothesis testing are many implicit hypotheses that we treat as truth. If any one of them prove to be not true then our conclusion from the A/B testing will be wrong.

Dave Rekuc, who runs an eCommerce site, posed a question in Avinash Kaushik’s blog post on test for statistical significance and A/B testing. Dave’s question surfaces the very issue of one such hidden hypothesis

I work for an ecommerce site that has a price range of anywhere from $3 an item to $299 an item. So, I feel like in some situations only looking at conversion rate is looking at 1 piece of the puzzle.

I’ve often used sales/session or tried to factor in AOV when looking at conversion, but I’ve had a lot of trouble coming up with a statistical method to ensure my tests’ relevance. I can check to see if both conversion and AOV pass a null hypothesis test, but in the case that they both do, I’m back at square one.

Dave’s question is, whether the result from the conversion test experiment hold true across all price ranges.

He is correct in stating that looking at conversion rate alone is looking at one part of the puzzle.

When  items vary in price, like he said from $3 to $299, the test for statistical significance of difference between conversion rates assumes an implicit hypothesis that is treated as truth.

A1: The difference in conversion rates does not differ across price ranges.

and the null hypothesis (same, just added for completeness)

H0: Any difference between the conversion rates is due to randomness

When your data tells you that H0 can or cannot be rejected, it is conditioned on the implicit assumption A1 being true.

But what if A1 is false? In Dave’s case he uncovered one. What about many other such hypotheses? Other examples include, treating the target population as the same (no male/female difference, no Geo specific difference etc) and products as the same.

I point out to two different results from the same data set by segmenting and not segmenting the population  in one of my previous posts.

That is the peril of hidden hypotheses.

What is the solution for a situation like Dave’s?  Either you explicitly test this assumption first or as simpler option, segment your data and test each segment for statistical significance. Since you have a range of price points I recommend you test over 4-5 price ranges.

What is the solution for the bigger problem of many different hidden hypotheses?

Talk to me.

Use of Information Priors in A/B Testing

Last time I wrote about the use of prior knowledge in A/B testing there was considerable push back from the analytics community. I think I touched a nerve when I suggested the use of “how confident you were before the test” to interpret the results after the test.  While the use of  such information may sound like gut-feel and arbitrary, we must recognize that we implicitly use considerable information priors in A/B testing. The Bayesian methods I used just made the implicit assumptions explicit.

When you finally get down to test two (or three) versions with  A/B split testing, you have implicitly eliminated many other versions. You should stop and ask why you are not testing every possible combination. The answer is you applied tacit knowledge that you have, either based on your own prior testing or well established best practices and eliminated many versions that required no testing. That is the information prior!

Now let us take this one step further. Of the two versions you selected, make a call on how confident you are that one will perform better than the other. This can be based on prior knowledge about the design elements and user experience or an estimate that is biased. This should not surprise you, after all we all seem to be finding reasons why one performed better than the other after the fact.  In fact the latter scenario has hindsight bias whereas I am simply asking you to state your prior expectation of which version will perform better.

Note that I am not asking you to predict by how much, only how confident you are that there will be real (not statistically significant, but economically significant) difference between the two versions. You should write this down, before you start testing and not after (I prefer to call A/B testing as collecting data). As long as the information is obtained through methods other than this test in question, it is a valid prior. It may not be precise  but it is valid.

What we have is the application of information priors in A/B testing – valid and relevant.

Next up, I will be asking you get rid of the test for statistical significance and look at A/B testing as a mean to reduce uncertainty in decision making.