Pie Charts: A Journey

As a newly-minted PhD Statistician, I was hired by a company that didn’t have a lot of native statistical expertise because they wanted to change that. As a result, I felt empowered to give lots of opinions on topics within my domain to anyone who happened to be in the room, including the head of the division. One of those opinions was that pie charts were the worst. I viewed pie charts as the scarlet letter of bad analysis: Having one in your analysis should get you shamed and shunned.


This is an update to my Analysis Philosphy page, which is still working towards completion Nonlinearity is a commonly-misunderstood problem when it comes to data analysis, mostly because our profession has once again managed to find a way to use a simple-sounding term in a way that’s counterintuitive to lay audiences. (See also Artificial Intelligence is Dumb.) When people think about nonlinear response variables, they think of functions that have non-linear relationships.

DIY Metrics: Game Logs

Previously on DIY Metircs… Last time in the DIY Metrics series, we had reached the point where we could extract a host of individual metrics from our data set using a function we’d named add_simple_stat_indicators: add_simple_stat_indicators <- function(tb){ tb %>% mutate( gotblk = (description == "BLOCK"), gotstl = (description == "STEAL"), gotast = (description == "ASSIST"), gotreb = map_lgl(description, str_detect, "REBOUND"), tfoulu = map_lgl(description, str_detect, "T.FOUL"), tfoull = map_lgl(description, str_detect, "T.

How Analytics Ruins Sports

With the recent success of the Rockets, people are trotting out that old saw about analytics nerds ruining sports. With the Houston Rockets specifically, the question is a combined referendum on the numbers-based approach of GM Daryl Morey and the foul-drawing proclivities of Houston’s two stars, James Harden and Chris Paul. Of course, the latter is linked with the former, since analytics shows us that drawing shooting fouls is extremely efficient offense.

ML Invents Metadata

Disclaimer: This post is at least tongue-half-way-in-cheek. I acutally like the article I’m lampooning. A recent publication by academics and AI researchers titled “Data Sheets for Datasets” calls for the Machine Learning community to ensure that all of their datasets are accompanied by a “datasheet.” These datasheets would contain information the dataset’s “motivation, composition, collection process, recommended uses, and so on.” The authors, Gebru, et al., would you like to include more data about your dataset.

DIY Metrics: Counting up simple statistics

Alrighty! This post got delayed a bit due to real life as well as some challenges with the data. But it’s also an exciting post because we’re finally on the road to generating player-level counting statistics! Simple Statitistics This post is focused on simple counting stats or box score statistics that were basically the standard way to discuss NBA players until quite recently. So aggregate numbers of rebounds, assists, steals, etc.

DIY Metrics: Normalizing by Posession

As promised, today we’re going to talk about normalizing by possession instead of time on court. First, a but of motivation. Different teams play at different paces. Some teams try to score a lot in transition, some teams try to slow the ball down and make sure they get good shots in the half-court. Part of this is related to a team’s defense and how quickly they get rebounds in the hands of players who can push the ball.

DIY Metrics: Why Do we Normalize

Until now, we’ve normalized our data by time. This means we’ve been reporting out stats on a “per X minutes” basis. Today, we’re going to unpack a little bit about why we normalize and why we might not always want to normalize by time in the context of the NBA. What is “normalizing”? Normalization is the act of putting different observations on a level playing field. (That’s not literally what Wikipedia says, but I think it’s a fair paraphrasing for our application.

DIY Metrics: Full-season five-man Plus/Minus

Previously on DIY Metrics, we did some remedial cleaning on the full 17-18 play-by-play data set. Today, we’re going to take that clean data, generate full-season five-man plus/minus numbers, and then do some plotting! Cleaning, again So, turns out there were a few bugs that I didn’t catch the first time we went through the cleaning process. This is fairly typical in my experience: You’ll go through your data cleaning and think everything is Gucci only to find once you start your analysis that there are irregularities or issues you hadn’t considered.

DIY Metrics: Preparing a new data set

So I finally broke down and got a full season’s worth of NBA play-by-play data to work on. Going forward, I’ll be using the full 2017-2018 play-by-play data from NBAstuffer. To date, I’ve been building my scripts using functional programming with the goal of having each step easily work with new data sets. This will be a good test of whether I’ve been successful! But before we do that, we need to look at the new data set and see what, if anything has changed.