Common Analysis Errors

Bad Data Science in the Wild

Today’s example comes from a Reddit post on USMNT subreddit that shows the proportion of minutes played by US Men’s National Team (USMNT) players who participated in the January mini-camp the USMNT does every year. OP made the following plot: IDK what this actually means, but I sure know what people will think when they see it! Background The context here is that fans are generally dissatisfied with the USMNT right now, and one of the reasons is that Gregg Berhalter (the USMNT coach) doesn’t call up the right players.

The common failure mode of statistics and economics

Both Economics and Statistics share a peculiar failure mode: Many critical results in both rely on “large sample”/“long run average” proofs. The Central Limit Theorem is fundamental to much of classical statitics, including most (if not all) of the fundamental approaches that people are exposed to in their first few courses. The Efficient Market Hypothesis underpins much of the economic theory on which Western economies are based. Both are powerful tools for explaining common phenomena and often make complex problems simpler to understand and model.

The Crusade Against P-values

So we’ll call that break a “summer hiatus”. But now we’re back, and coming recently from the Joint Statistical Meetings (2019) in Denver, I’ve got Thoughts. This year’s JSM was different for me, because I spent most of my time on recruitment, speaking with potential applicants during many of the sessions. As a result, I attended many fewer talks that I normally do. By happenstance, the topic of the p-value came up repeatedly in the talks I was able to attend.

Pie Charts: A Journey

As a newly-minted PhD Statistician, I was hired by a company that didn’t have a lot of native statistical expertise because they wanted to change that. As a result, I felt empowered to give lots of opinions on topics within my domain to anyone who happened to be in the room, including the head of the division. One of those opinions was that pie charts were the worst. I viewed pie charts as the scarlet letter of bad analysis: Having one in your analysis should get you shamed and shunned.

Nonlinearity

This is an update to my Analysis Philosphy page, which is still working towards completion Nonlinearity is a commonly-misunderstood problem when it comes to data analysis, mostly because our profession has once again managed to find a way to use a simple-sounding term in a way that’s counterintuitive to lay audiences. (See also Artificial Intelligence is Dumb.) When people think about nonlinear response variables, they think of functions that have non-linear relationships.

DIY Metrics: Game Logs

Previously on DIY Metircs… Last time in the DIY Metrics series, we had reached the point where we could extract a host of individual metrics from our data set using a function we’d named add_simple_stat_indicators: add_simple_stat_indicators <- function(tb){ tb %>% mutate( gotblk = (description == "BLOCK"), gotstl = (description == "STEAL"), gotast = (description == "ASSIST"), gotreb = map_lgl(description, str_detect, "REBOUND"), tfoulu = map_lgl(description, str_detect, "T.FOUL"), tfoull = map_lgl(description, str_detect, "T.

Distributionality

This is an update to my Analysis Philosphy page, which is still working towards completion I only get 1,750 hits on Google when I search for “Distributionality”, so maybe I should clarify what I mean, though I don’t think it’s anything profound. That data follow distributions is a tautology. When this doesn’t appear the case, it means we’ve failed to properly model hte data generation function. The most typical failure mode is to assume that the distribution is simpler than it is.

Projected records and rankings aren't equivalent

Nate Duncan’s “Dunc’d On” is probably my favorite NBA podcast. He and frequent co-host Danny Leroux are analytical and comprehensive, covering the whole league. About every other week, they’ll go through ever team in a conference (East or West) and talk about how each team is doing, where they’re projected to finish, etc. They call these episodes “15 in 60”, although they don’t always get to all 15 teams in the conference, and I don’t think they’ve ever done one of these in 60 minutes.