Correlation versus causation? I prefer correlation and causation

One of my mom’s favorite sayings about the perils of using only correlations to infer causation is “breakfast doesn’t cause lunch.” And I have to agree with her: this statement is a perfect illustration of the real danger in taking a correlation to mean causation.[1]

In an article I read recently about (the hazards of) big data, the authors made a key point:[2]

[Google’s engineers] cared about correlation rather than causation. This is common in big data analysis. Figuring out what causes what is hard (impossible, some say). Figuring out what is correlated with what is much cheaper and easier. That is why, according to Viktor Mayer-Schönberger and Kenneth Cukier’s book, Big Data, ‘causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning.’

I had two responses when I read this.

First: What?! There really people out there who think that causation has been ‘knocked off its pedestal’?!

Second: I started thinking more deeply about correlations and causations, especially because it seems like a lot of the world can be divided into two camps: on one side, people who think correlations are the be-all-end-all; on the other side, people who, knowing the breakfast-causes-lunch fallacy, insist that correlations say almost nothing at all.

Looking at that second point, what’s the right answer? Without a doubt, causation should remain on its pedestal. But correlations, when used appropriately, have their place too. In short, if you want to be a scientist, you should be thinking about correlation AND causation.

You see, correlations are useful for the exact same reason they are so treacherous: they suggest potential causations and help form hypotheses that can (and should) be explored more rigorously.

It’s like you’re on a first date with the data, and correlations are a way of starting the conversation—a sort of mathematical handshake, “hello” and “how do you do” all rolled into one.

To think about this a little more deeply, let’s imagine you’re a horror movie and cookie dough ice cream aficionado. Over the years, you’ve noticed that you have had some pretty tasty bowls of cookie dough ice cream while enjoying some amazing horror films. One day it hits you: Does indulging in cookie dough ice cream make horror films more enjoyable? Or does a fantastic horror film cause the cookie dough ice cream to taste better? Or, maybe, I just love them both, whether together or not.

Would correlations help us answer this conundrum?

To start, we can think of eight different causal relationships. (In interpreting this figure, arrows denote “causes”, and the T-bars denote “inhibits.”)

Now, let’s say that the data look something like this:

So, of those eight original models, we can exclude D–G. But, and here’s the important part, can we exclude Model H? No. Can we distinguish between these Model A, B, C and H? Of course not. Now we might begin to strongly favor Models A–C, because they give us a “nice” narrative, but we cannot say with any certainty that these two characteristics are any more than merely correlated.

In contrast, let’s say that our data look like this:

How do we feel about those models now? Well, given the lack of correlation between cookie dough indulgence and horror film enjoyment, it now seems extremely unlikely that these two are causally related. Certainly, if they are, it’s not going to be anything so straightforward as Model A–G. And definitely, the data strongly suggest that these two characteristics are wholly unrelated—in other words, Model H.

So to summarize, correlations can:

Suggest possible causal relationships—i.e. help you form (likely) hypotheses;
Help exclude causal relationships.

But—importantly—correlations cannot:

Demonstrate causal relationships;
Distinguish between various causal relationships.

So, yes, correlations are a really good way to greet your data. But, as in any good date, the “hello” is just a beginning, and, at some point, you have to move beyond small talk.

Because, remember, breakfast doesn’t cause lunch.

[1] She cites the mystery novelist Robert Parker as the origin for this quote. But I prefer to cite her, especially with Mother’s Day being next week.

[2] Incidentally, this article is really quite excellent and well worth a read.

2 thoughts on “Correlation versus causation? I prefer correlation and causation”

Richard Fergie

May 1, 2014 at 3:13 pm

http://jliszka.github.io/2013/12/18/bayesian-networks-and-causality.html goes over some of these ideas in more depth. But for me at least, it is much less easy to read. Are you building up a series of posts on this kind of thing?

- Olivia S. Rissland
  
  May 1, 2014 at 3:16 pm
  
  Oh–that’s a really great link! Thanks!

How to Be a Scientist

(Or at least think like one)

Correlation versus causation? I prefer correlation and causation

2 thoughts on “Correlation versus causation? I prefer correlation and causation”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Correlation versus causation? I prefer correlation and causation”

Leave a comment Cancel reply