Understanding Simpson's Paradox

Back to White Papers


We were faced with a conundrum following one of our client’s recent mailing campaigns to test both a new creative and new lists. The agency wanted to test new creatives as well as a couple of new lists, the client insisted the bulk of the mailing went to their banker list using the tried and tested control – neither asked us data scientists for advice with the test matrix.

The results

On seeing the overall results the client concluded their control was the winner. The agency, on the other hand, concluded the opposite. See below:

Simpson's Paradox Table 1

Overall it was clear to the client that their control was the winner, but the agency who had split the creative control by lists said clearly the new creative was the winner as illustrated by individual list results.

Simpson's Paradox Table 2

So what's happening?

Do the Maths and you will see they are both right, looking at list level, the test creative is the winner on every occasion, looking at the overall results the control is the winner!!

What we are seeing is the Simpson’s Paradox, and as a data agency we often see this when a scientific approach to sampling is not taken.  This effect is caused by weighted averages. The list volume is, in this case, called a “lurking” variable or confounding variable. It is unevenly distributed between each list mail volume and response and is the cause of the bizarre results. In these situations – it’s like comparing chalk and cheese. Only when we are sure the control and test groups are characteristically very similar, apart from their likelihood to respond to a marketing campaign, we will not be deceived by this paradox. Unfortunately this is not very often.

How to avoid the paradox

Take a scientific approach to test design, whether it be off-line as in the above example (a very costly mistake) or on-line PPC digital testing. As marketing professionals begin to unleash the power of large sample online testing, it becomes paramount this trap is avoided.

Make sure that samples are spread evenly across lists. Or better yet employ stratified sampling techniques which recognises the different sub populations within the data and collects a random sample from each.

To assess the overall results here is a mistake. When a dataset represents a large population covering characteristically diverse sub populations, ignoring this dynamic is at best sloppy. So not to be undone by this pernicious paradox, Marketing Metrix adopts a comprehensive approach to the collection of relevant data and rigorously analyses all relevant variables.

The main lesson this teaches us is the importance of a granular approach to data analysis and sensitivity to the dangers of confounding variables.