DIY Metrics: Net Ratings (ish)

March 09, 2019

basketball, statistics, metrics, sports, R, diymetrics

Last time on DIY Metrics, we calculated five-man-unit plus/minus ratings from scratch. If we want to use measures like this to compare performance for these groups of players, its important to consider how much game time we have for each unit. There’s a relevant discussion to be had about whether “number of posessions” or “elapsed time” is the best way to compare these groups, (IMO, it depends on what specific question you’re trying to answer with your metric) but today we’ll avoid that discussion and normalize over time because it’s easier.

Data Cleaning

Our first step is to build out our workflow to allow us to do more interesting things. Thus far, we’ve been starting from scratch each time. This isn’t typically how you’d work on a project. The workflow we’ll show today is better for longer term projects. We’ll start by building up a script that generates a “cleaned” version of our raw data set (from NBAstuffer).

Here’s our script for data cleaning:

library(tidyverse)

tbraw <- read_csv("data/sample-combined-pbp-stats.csv") 

teamsandplayers <- tbraw %>% 
  filter(elapsed > 0) %>% #eliminates weird cases like opening jump ball
  select(player, team) %>% 
  filter(!is.na(player)) %>% 
  distinct() 

tb <- tbraw %>% 
  mutate(hometeam = get_team(h1, teamsandplayers),
         awayteam = get_team(a1, teamsandplayers),
         pointchange = map_lgl(points, score_changed)) 


allteams <- tb %>% 
  filter(elapsed > 0) %>% #eliminates weird cases like opening jump ball
  select(team) %>% 
  filter(!is.na(team)) %>% 
  distinct() 

tmp <- left_join(allteams, get_team_events(allteams, tb))

Simple, right? Plus a lot of those function calls look familiar from previous installments. Well, except for one. The main work-horse here is get_team_events, which is worth looking at in more detail. (I’ve included all the helper functions I used at end of the post if you’re interested in them or want to try to run this code on your own.)

get_team_events <- function(whichteam, tb){
  
  out <- NULL
  for(i in seq_along(whichteam$team)){
    
    thatteam <- whichteam$team[i]
    
    teamtbl <- tb %>% 
      mutate(currentteam = thatteam) %>% 
      filter(hometeam == currentteam | awayteam == currentteam) %>% 
      fiveplayers()
    
    thisteamsplayers <- get_this_teams_players(teamtbl, thatteam)
    
    out <- teamtbl %>% 
      add_subsections(thisteamsplayers) %>% 
      mutate(netpoints = pmap_dbl(list(points, currentteam, team), get_net_points)) %>% 
      select(-team) %>% rename(team = currentteam) %>% 
      bind_rows(out)
    
  }
  out %>% group_by(team) %>% nest(.key = `team events`)
  
}

For each team in the data set, this function creates a specific table that contains all events that team participated in, and tranforms the outcomes (e.g., whether a basket is a “positive” or “negative” to the perspective of that team. If it does look kinda familiar, it’s an updated version of get_team_scores from last week. The key difference is the add_subsections subfunction, which divides the game into subsections with identical personnel (from the team of interest). This can be useful if we want to aggregate over these time periods.

The last step in this script is to save the cleaned data set for future retrieval and use:

write_rds(tmp, "clean-data/team-events.rds")

I save it as an rds file here to preserve the nesting behavior of the tbl_df we’ve just built. You can also just saves it as a .csv if you want the more general formatting.

Getting the plus/minus scores and times

With our nicely cleaned data set, it’s now pretty simple to get the information we want.

First, we read in our data:

library(lubridate)
library(tidyverse)

tmp <- read_rds("clean-data/team-events.rds")

Next, we’ll need a new function:

get_fiveman_plusminus <- function(tb, team){
  
  tb %>% 
    mutate(team = team) %>% 
    mutate(pl = hms(play_length)) %>% 
    group_by(p1, p2, p3, p4, p5) %>% 
    summarise(`Plus/Minus` = sum(netpoints, na.rm = T),
              unittime = sum(pl)) 
  
}

This function is designed to work with purrr::map2 on the nested data set we just built. To get the time values, we use the play_length column from our data set along with lubridate::hms, which takes the column and tells R how to process it properly. I found this worked so well that I want to pause to illustrate it quickly:

tmp %>% 
  filter(team == "BOS") %>% 
  select(`team events`) %>% 
  unnest %>% 
  select(play_length) %>% 
  mutate(pls = hms(play_length)) #  <-- critical line of code from lubridate

## # A tibble: 477 x 2
##    play_length pls         
##    <chr>       <S4: Period>
##  1 0:00:00     0S          
##  2 0:00:00     0S          
##  3 0:00:16     16S         
##  4 0:00:17     17S         
##  5 0:00:04     4S          
##  6 0:00:02     2S          
##  7 0:00:03     3S          
##  8 0:00:16     16S         
##  9 0:00:03     3S          
## 10 0:00:08     8S          
## # ... with 467 more rows

Boy, howdy that’s easy! And the new column will do intuitive things when, for example, you use functions like sum over it.

Now let’s build our plus-minus data and check it out!

pm <- tmp %>%
  mutate(`Five-man Plus/Minus` = map2(`team events`, team, get_fiveman_plusminus))

pm %>% filter(team == "BOS") %>% select(`Five-man Plus/Minus`) %>% unnest()

## # A tibble: 25 x 7
##    p1          p2        p3        p4        p5       `Plus/Minus` unittime
##    <chr>       <chr>     <chr>     <chr>     <chr>           <dbl>    <dbl>
##  1 Al Horford  Jaylen B~ Marcus S~ Terry Ro~ Kyrie I~            3      103
##  2 Al Horford  Jaylen B~ Marcus S~ Terry Ro~ Shane L~           -1       89
##  3 Al Horford  Kyrie Ir~ Semi Oje~ Marcus S~ Jaylen ~           -3      128
##  4 Al Horford  Terry Ro~ Kyrie Ir~ Semi Oje~ Marcus ~            0        0
##  5 Gordon Hay~ Jayson T~ Al Horfo~ Jaylen B~ Kyrie I~            3      315
##  6 Jaylen Bro~ Kyrie Ir~ Jayson T~ Al Horfo~ Marcus ~           -6      298
##  7 Jaylen Bro~ Kyrie Ir~ Marcus S~ Aron Bay~ Semi Oj~            0        0
##  8 Jaylen Bro~ Marcus S~ Aron Bay~ Semi Oje~ Terry R~           -4      117
##  9 Jaylen Bro~ Marcus S~ Terry Ro~ Kyrie Ir~ Aron Ba~            0       88
## 10 Jaylen Bro~ Terry Ro~ Kyrie Ir~ Aron Bay~ Jayson ~            2      129
## # ... with 15 more rows

This is basically where we were last time, except now we’ve got the total duration of game time each of these units played. This also has the values broken out by “game section” as we defined it above, so we’ll have to include a summarise step to aggregate these. Also note that “0 second” periods. This is due to a quirk with the data set, which codes each substitution individually. When two players check in simultaneously, there’s a “0 second duration” section when only one player has checked in. So basically we want to ignore those sections, and we do so with a filter(unittime > 0) step.

By taking those times, summing over we can normalize unit performance to a per-36-minute basis:

pm %>% filter(team == "BOS") %>% select(`Five-man Plus/Minus`) %>% unnest() %>% 
  group_by(p1, p2, p3, p4, p5) %>% 
  filter(unittime > 0) %>% 
  summarise(tpm = sum(`Plus/Minus`),
            ttime = sum(unittime)) %>% 
  mutate(normpm = tpm * 3600 / ttime)

## # A tibble: 21 x 8
## # Groups:   p1, p2, p3, p4 [20]
##    p1          p2         p3         p4        p5          tpm ttime normpm
##    <chr>       <chr>      <chr>      <chr>     <chr>     <dbl> <dbl>  <dbl>
##  1 Al Horford  Jaylen Br~ Marcus Sm~ Terry Ro~ Kyrie Ir~     3   103  105. 
##  2 Al Horford  Jaylen Br~ Marcus Sm~ Terry Ro~ Shane La~    -1    89  -40.4
##  3 Al Horford  Kyrie Irv~ Semi Ojel~ Marcus S~ Jaylen B~    -3   128  -84.4
##  4 Gordon Hay~ Jayson Ta~ Al Horford Jaylen B~ Kyrie Ir~     3   315   34.3
##  5 Jaylen Bro~ Kyrie Irv~ Jayson Ta~ Al Horfo~ Marcus S~    -6   298  -72.5
##  6 Jaylen Bro~ Marcus Sm~ Aron Bayn~ Semi Oje~ Terry Ro~    -4   117 -123. 
##  7 Jaylen Bro~ Marcus Sm~ Terry Roz~ Kyrie Ir~ Aron Bay~     0    88    0  
##  8 Jaylen Bro~ Terry Roz~ Kyrie Irv~ Aron Bay~ Jayson T~     2   129   55.8
##  9 Jaylen Bro~ Terry Roz~ Kyrie Irv~ Jayson T~ Al Horfo~     0    13    0  
## 10 Jayson Tat~ Al Horford Aron Bayn~ Terry Ro~ Kyrie Ir~    -4    79 -182. 
## # ... with 11 more rows

So there you have it! These numbers look wonky becayse we’re still just looking at one or two game’s worth of data for each team. Suppose I’ll have to break down and pay for the full season’s worth of data at some point so this looks more reasonable….

But good enough for now!

Helper functions (data cleaning)

These are the helper functions used in the get_team_events function which does the main data cleaning from above:

get_team <- function(playernames, teamsandplayers){
  
  teamsandplayers$team[match(playernames, teamsandplayers$player)]
  
}

score_changed <- function(points){
  
  !is.na(points) && points > 0
}

get_net_points <- function(points, histeam, team){
  if(is.na(team))
    return(NA) else
      if(histeam == team)
        return(points) else
          return(points * -1)
}


add_subsections <- function(teamtbl, thisteamsplayers){
  
  subsec <- rep(0, nrow(teamtbl))
  
  counter <- 1
  for(i in 1:nrow(teamtbl)){
    if(!is.na(teamtbl$entered[i]) && teamtbl$entered[i] %in% thisteamsplayers$player) counter <- counter + 1
    subsec[i] <- counter
  }
  
  teamtbl$subsec <- subsec
  teamtbl
  
}


get_this_teams_players <- function(teamtbl, thatteam){
  
  teamtbl %>% 
    filter(elapsed > 0,             #eliminates weird cases like opening jump ball
           team == thatteam) %>% 
    select(player, team) %>% 
    filter(!is.na(player)) %>% 
    distinct()
}

fiveplayers <- function(tb){
  
  tb1 <- tb %>% filter(currentteam == hometeam) %>% 
    rename(p1 = h1, p2 = h2, p3 = h3, p4 = h4, p5 = h5, o1 = a1, o2 = a2, o3 = a3, o4 = a4, o5 = a5)
  
  tb2 <- tb %>% filter(currentteam == awayteam) %>% 
    rename(p1 = a1, p2 = a2, p3 = a3, p4 = a4, p5 = a5, o1 = h1, o2 = h2, o3 = h3, o4 = h4, o5 = h5)
  
  bind_rows(tb1, tb2)
  
}