Correlation: the Fundamental Pitfall of Big Data

Apr 26, 2022

One of the biggest shifts in the realm of healthcare is that huge amounts of data are now being collected from electronic medical records and monitoring devices. A lot of people are excited about this because big data has tremendous potential to lead to new discoveries that otherwise wouldn’t be possible.

For example, in 2011, a paper published in Clinical Pharmacology & Therapeutics revealed a previously unknown drug-drug interaction between paroxetine (commonly prescribed to treat depression) and pravastatin (commonly prescribed to treat high cholesterol). Data mining enabled the researchers to discover that these medications cause elevated blood glucose levels when given together even though neither of the medications have this effect when given alone. This finding demonstrates how using big data as a research tool can help improve clinical decision-making. Ultimately, the hope is that finding new patterns like this will lead to improved health outcomes.

That being said, using big data to inform clinical medicine comes with a pitfall the size of the Grand Canyon. Data mining — the process of finding patterns in large data sets — is fundamentally based on correlation, not causation. As a result, this method is incredibly prone to false discovery. Although some of the correlations uncovered by data mining will turn out to be meaningful, the vast majority of them won’t be. Such false findings are already a problem in scientific research, and the proliferation of data mining is bound to increase this problem exponentially.

Tyler Vigen helped demonstrate this by publishing an entire book full of spurious correlations. For example, he found that between 2000 and 2009, per capita cheese consumption was correlated with the number of people who died by becoming tangled in their bedsheets. Of course, common sense tells us there’s no real relationship between these two variables, but many correlations aren’t so blatantly spurious.

In 2009, a study published in Archives of Internal Medicine found that meat consumption was associated with increased mortality. It sounds plausible, especially considering that the researchers controlled for potential confounding factors like age, education, exercise habits, and smoking. Therefore, in light of this information, perhaps a public health campaign should be launched to encourage people to be vegetarian. Or maybe meat should be taxed like alcohol, soda, and cigarettes.

It can be a pretty short leap from discovering a new, interesting correlation to advocating for the use of a new, unproven intervention. However, a deeper dive into this study reveals that in men, a higher consumption of red meat was also associated with an increased risk of accidental death, such as being killed in a car accident. Since this is clearly coincidental, it follows that the paper’s main conclusion (that meat consumption is associated with increased mortality) should also be taken with a grain of salt.

Given enough data, it isn’t difficult to find intriguing correlations, but correlations are often misleading. Even though we know that correlation does not imply causation, simply being aware of this isn’t enough to protect against the innate human tendency to jump to plausible, albeit inaccurate conclusions. Cognitive biases are a powerful force, and we can easily find ourselves walking down the wrong path. However, we have an opportunity to implement a robust system of checks and balances to help ensure that false findings don’t become persistent. Putting precautions in place ahead of time is essential for safeguarding scientific progress.

The Medical Atlas

Correlation: the Fundamental Pitfall of Big Data