The Arrogance of "Noise"

February 02, 2019

statistics, metrics, failure to communicate

This is a post about communication.

One of the through-lines of my academic and professional career is conflict between entrenched subject matter experts (SME) and hot-shot quantitative analysts. As a young undergraduate, I followed Baseball Prospectus Fangraphs through the SABRmetric revolution. I watched Nate Silver bring data-driven prognostication to the world of political journalism which had previously (and arguably still is) dominated by punditry. In my current job, I work with experienced analysts who have often been working on the same systems for years. In all of these cases, you have an existing set of methods and individuals experienced with those methods being confronted by a new set of analysts who want to use “advanced statistics” and think they have a better way to do things.

There’s probably a post worth doing just talking about these histories and how fields absorb new ideas and approaches (analytical paradigms?), but here I want to focus on a particular verbal tick many data scientists fall victim to; the above cases provide instructive examples.

Defense-Independent Pitching

Traditional measures of baseball pitcher performance like Wins and ERA are pretty poor metrics. Shortcomings of ERA include the key issue that earned runs are not exclusively attributable to pitcher performance. The defenders in the field also have a role to play in whether batted balls are hits and whether runners are able to score. Thus, the create of so-called “Defense-Independent Pitching Statistics” or DIPS. The first thought was to create a pitcher performance metric based exclusively on things that had nothing to do with anyone but the pitcher. This means focusing on strikeouts, walks and home runs. Everything that ends up in play? Well, that’s beyond the pitcher’s control. For example, FIP.

When this stuff came out, it was controversial. The arm-chair analysts were saying that pitchers had no control over what happened to batted balls, which baseball scouts, who had decades of experience watching baseball and evaluating talent, thought was absurd. “How can you not attribute to a pitcher the responsibility for surrendering a stinging line drive that hits the gap and goes for a double?” the scouts would ask. “It’s just luck!”, replied the analysts, safe in their knowledge that their metrics were far more stable year-over-year than ERA.

After years of looking at things (and producing richer data, including the ability to track the velocity and angle of every batted ball), the current consensus is a bit different. Pitchers have some control over things like many fly balls they give up. So as metrics evolved (e.g., xFIP), they began to adjust for things like fly ball rate.

So variation in ball in play outcome for pitchers isn’t entirely luck, and it turns out all those grizzled scouts were right?

Well, sorta….

The Myth of the Hot Hand

The controversy over whether the “hot hand” exists for basketball players pre-dates the SABR struggles in MLB, but it took a bit longer to resolve. The history here is well-covered, so I won’t go into much detail. Suffice it to say that Tversky and friends showed in 1985 that they couldn’t find evidence that players were streaky, based on some limited data and specific hypothesis tests. (Nintendo famously ignored these conclusions when designing NBA Jam.) This result was used as an example of how humans recognize fake patterns in random noise and as an anecdote for nerds to use when trying to relate to more athletic types. (It typically didn’t work.)

Athletes consistently differed, stating sincere belief that they were, indeed, more likely to make a shot when they were “hot”. And subsequently, the athletes were validated when better data gathering allowed researchers in 2014 to publish a new study, finding “a small, positive and significant Hot Hand effect.” (Bocskocsky, et al, 2014)

So once again, the arm-chair analysts were shown to be fools for claiming that players never got “hot” and people’s belief in the phenomena was because humans tend to fabricate patterns in noise even when none are there.

Just Becuase You can’t Explain it Doesn’t Make it Inexplicable

Summarizing the above vignettes, one might say:

In both of the examples above, the innovators went in to smash the dogma of the entrenched regime. They found a point upon which they thought the conventional wisdom was wrong, did some analysis to back it up, and declared victory. Over time, as better data and analytic techniques became available, it was shown that those initial conclusions were incorrect.

The real issue is two-fold. First, there’s the obvious conflation of a lack of conclusive evidence in favor of a hypothesis with evidence in favor of the opposite conclusion. (Neither of the original studies were Bayesian.) But the more interesting point to me is how easily folks slip into saying “It’s just noise” as a short-hand for “I can’t find evidence in the data for the specific hypothesis I tested”. Even worse when this is used to ridicule SMEs for holding onto sacred cows that aren’t easy to verify based on extant data. This is what happened repeatedly throughout the SABRmetrics revolution, first on internet forums, then on radio and television shows, and eventually on the big screen.

And I don’t make this point because I believe the metrics folks were wrong on the substance of the two matters above! Rather, I think they were basically correct, and the modern research which “refutes” them only does so in a limited way. Pitchers have limited control over the types of batted balls they give up, and variability from one season to the next for a single pitcher is greater than the variance between pitchers over the long run. Similarly, a 1 to 2 percentage point “hot hand” is pretty minuscule when compared with comments by players who get in grooves where they feel like they, “can’t miss”. By stating the conclusions too broadly, (“There is no hot hand!”, “Pitcher performance on balls in play is just luck!”) people end up looking silly in retrospect despite being mostly right!

The answer is for analysts and data scientists to be more circumspect in their . Don’t attribute to “noise” or “luck” what probably isn’t. In basketball, shot selection and defense are obvious factors that influence the likelihood of a shot going in, so attribute the high levels of variation to these or other unmeasured factors instead of saying a player was just “lucky”. Similarly, note that while pitchers may have some control over where they put a pitch, the batter gets a much bigger vote in the outcome.

The particularly frustrating thing is, often time original researchers will do this (see the Voros McCracken article on DIPS above, for example) but the folks that use their metrics don’t fully understand the caveat or choose to ignore them. It’s critical for data scientists to be able to communicate about their tools, and that means not stepping on your own feet by claiming that something is inexplicable when the SME sitting at the other end of the table has perfectly reasonable explanations at the ready.

When in doubt…

Briefly, I want to quote Andrew Gelman whose talked a lot about the “hot hand” on his blog. While written to the specific example of the “hot hand”, I think the conclusions are pretty portable:

The effects are certainly not zero. We are not machines, and anything that can affect our expectations (for example, our success in previous tries) should affect our performance.

The effects I’ve seen are small, on the order of 2 percentage points (for example, the probability of a success in some sports task might be 45% if you’re “hot” and 43% otherwise).

There’s a huge amount of variation, not just between but also among players. Sometimes if you succeed you will stay relaxed and focused, other times you can succeed and get overconfidence.

Whatever the latest results on particular sports, I can’t see anyone overturning the basic finding of Gilovich, Vallone, and Tversky that players and spectators alike will perceive the hot hand even when it does not exist and dramatically overestimate the magnitude and consistency of any hot-hand phenomenon that does exist.