Statistics

Disclaimer: This post is at least tongue-half-way-in-cheek. I acutally like the article I’m lampooning. A recent publication by academics and AI researchers titled “Data Sheets for Datasets” calls for the Machine Learning community to ensure that all of their datasets are accompanied by a “datasheet.” These datasheets would contain information the dataset’s “motivation, composition, collection process, recommended uses, and so on.” The authors, Gebru, et al., would you like to include more data about your dataset.

Alrighty! This post got delayed a bit due to real life as well as some challenges with the data. But it’s also an exciting post because we’re finally on the road to generating player-level counting statistics! Simple Statitistics This post is focused on simple counting stats or box score statistics that were basically the standard way to discuss NBA players until quite recently. So aggregate numbers of rebounds, assists, steals, etc.

As promised, today we’re going to talk about normalizing by possession instead of time on court. First, a but of motivation. Different teams play at different paces. Some teams try to score a lot in transition, some teams try to slow the ball down and make sure they get good shots in the half-court. Part of this is related to a team’s defense and how quickly they get rebounds in the hands of players who can push the ball.

Until now, we’ve normalized our data by time. This means we’ve been reporting out stats on a “per X minutes” basis. Today, we’re going to unpack a little bit about why we normalize and why we might not always want to normalize by time in the context of the NBA. What is “normalizing”? Normalization is the act of putting different observations on a level playing field. (That’s not literally what Wikipedia says, but I think it’s a fair paraphrasing for our application.

Previously on DIY Metrics, we did some remedial cleaning on the full 17-18 play-by-play data set. Today, we’re going to take that clean data, generate full-season five-man plus/minus numbers, and then do some plotting! Cleaning, again So, turns out there were a few bugs that I didn’t catch the first time we went through the cleaning process. This is fairly typical in my experience: You’ll go through your data cleaning and think everything is Gucci only to find once you start your analysis that there are irregularities or issues you hadn’t considered.

So I finally broke down and got a full season’s worth of NBA play-by-play data to work on. Going forward, I’ll be using the full 2017-2018 play-by-play data from NBAstuffer. To date, I’ve been building my scripts using functional programming with the goal of having each step easily work with new data sets. This will be a good test of whether I’ve been successful! But before we do that, we need to look at the new data set and see what, if anything has changed.

This is an update to my Analysis Philosphy page, which is still working towards completion I only get 1,750 hits on Google when I search for “Distributionality”, so maybe I should clarify what I mean, though I don’t think it’s anything profound. That data follow distributions is a tautology. When this doesn’t appear the case, it means we’ve failed to properly model hte data generation function. The most typical failure mode is to assume that the distribution is simpler than it is.

Last time on DIY Metrics, we calculated five-man-unit plus/minus ratings from scratch. If we want to use measures like this to compare performance for these groups of players, its important to consider how much game time we have for each unit. There’s a relevant discussion to be had about whether “number of posessions” or “elapsed time” is the best way to compare these groups, (IMO, it depends on what specific question you’re trying to answer with your metric) but today we’ll avoid that discussion and normalize over time because it’s easier.

I mean the term of art, not the concept or field of study. And what’s dumb is how it’s applied. “Machine learning” is also dumb in a similar way. Some definitions for AI You can go back to the beginning if the field if you want to, but modern definitions tend to to be vague. There are good definitions out there, but these sound esoteric and unless you’re really interested in defining AI precisely, you’ll probably just stick with Merriam-Webster or Wikipedia, which means literally:

Last week, I described how to build a plus/minus score for individual players based on data from NBAstuffer. I enjoyed walking through that process, so lets continue the series and expand our focus. Five-man units vs. Individual Players One of the first things I talked about on this site was comparing different metrics and choosing the right one for the task at hand. Plus/minus for individual players is a weird metric, because it’s taking a team outcome (net change in score) and applying it at an individual level.

Statistics

ML Invents Metadata

DIY Metrics: Counting up simple statistics

DIY Metrics: Normalizing by Posession

DIY Metrics: Why Do we Normalize

DIY Metrics: Full-season five-man Plus/Minus

DIY Metrics: Preparing a new data set

Distributionality

DIY Metrics: Net Ratings (ish)

"Artificial Intelligence" is dumb

DIY Metrics: Five-Man Unit Plus/Minus