Understanding (and respecting) the limits of your data

The business of gathering data, of performing experiments—this is the bread and butter of being a scientist. The task of doing controls, of analyzing data—these are the essential jobs that just must be done if we want to make models and to understand the universe. Our ability to dissect the truth relies entirely upon the quality of the data we generate.

One of the ideas I was getting at in my last post is that we need to be rigorous when we gather, analyze and interpret the data. In response, one of my friends made an interesting point:

To conclude MBAs help or destroy the chance of becoming a unicorn would require both looking at the background distribution and controlling for all the other variables that may help or prevent success. Given the latter is hard to do, looking at the background distribution would still leave [it] inconclusive.

This is a sophisticated thought, which goes beyond the “mere” idea of good data analysis to considering the intrinsic limitations of the data itself. The fact of the matter is that, sometimes (and much more often than I would like!), the data just aren’t powerful enough to answer the question that I’ve asked. And so I simply lack the data to refine my hypothesis or to exclude several of a number of hypotheses.

To give this concept some grounding, let’s play a fun word game. It’s quite simple—just fill in the missing letter to create a true statement about my mom and me: “I am bl_nder than my mom.”

At the outset, there are 26 possibilities—but we can certainly cull some of these. The first thing to do is consider whether the missing letter is a consonant or a vowel. I’m sure you suspected immediately that no consonant would yield a word—a fact you can easily verify with a quick search through the dictionary. How about a ‘Y’? As you might have guessed, ‘blynder’ is not a word either.

This analysis leaves us with the five vowels. I have, at times, been an avid Scrabble player, and so I’m always on the lookout for interesting letter combinations that can form 7-letter words. This particular example is one of my favorites, because adding any vowel creates a word: ‘blander’; ‘blender’; ‘blinder’; ‘blonder’; ‘blunder’.

Can you further refine these five choices? You can’t use the criteria of it being a word any more; instead, you have to use the rest of the sentence and a slight knowledge of grammar. The next important observation is the use of ‘than’, which highlights that the missing word is an adjective (a comparative one, at that). Are any of our possible words not adjectives? Luckily, yes: ‘blender’ is a noun, and ‘blunder,’ a verb. (‘Blinder’ can be either a noun or an adjective, and so it remains on our list.) And so we have three remaining possibilities: “I am blander than my mom”; “I am blinder than my mom”; “I am blonder than my mom.”

Can you do better than this with the available data? No, not unless you get some more information about my mom and me. But, even then, it will always be very hard to rule out the possibility that I am, in fact, much blander than my mom. So, really, the best you can do is to say that our missing letter is an ‘A’, ‘I’ or ‘O’, and leave it at that.

In the case of MBAs and Unicorns, it’s likely that some of the data we need (such as the distributions of MBAs in Unicorn and non-Unicorn companies) is hard to find. But, worse than that, a lot of the data we need to be truly confident in our conclusions (e.g., knowledge of how other characteristics vary between Unicorn and non-Unicorn companies) may be even harder to come by.

And that’s the point. We should always respect the data, —and never forget what the numbers can, and cannot, tell us.

How to Be a Scientist

(Or at least think like one)

Understanding (and respecting) the limits of your data

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply