DIY Metrics: Counting up simple statistics
Alrighty! This post got delayed a bit due to real life as well as some challenges with the data. But it’s also an exciting post because we’re finally on the road to generating player-level counting statistics!
Simple Statitistics
This post is focused on simple counting stats or box score statistics that were basically the standard way to discuss NBA players until quite recently. So aggregate numbers of rebounds, assists, steals, etc. We’ve dabbled a bit already with more “advanced” metrics like raw and pace-adjusted plus/minus, so this is in some ways going back to basics. I think it’s worth exploring, however, because many of the things we look at here are key for taking some of the more “advanced” looks that folks like to do.
Cleaning, again.
Our first challenge is that the play-by-play data we’re working with doesn’t treat each type of event equally. For some (e.g., a turnover), there’s a specific tag showing that a turnover occurred. And as far as I can tell, it’s pretty consistent! There’s a column for it and everything. This is what it looks like:
tmp <- read_csv("data/[10-17-2017]-[06-08-2018]-combined-stats.csv") %>%
slice(1:1000)
tmp %>%
select(player, event_type, description) %>%
print(n = 10)
## # A tibble: 1,000 x 3
## player event_type description
## <chr> <chr> <chr>
## 1 <NA> start of per~ <NA>
## 2 Kevin Love jump ball Jump Ball Love vs. Horford: Tip to Irving
## 3 Kyrie Irving shot Irving 10' Driving Floating Jump Shot (2 PTS~
## 4 Derrick Rose miss Horford BLOCK (1 BLK), MISS Rose 2' Layup
## 5 Al Horford rebound Horford REBOUND (Off:0 Def:1)
## 6 Gordon Hayw~ miss MISS Hayward 25' 3PT Jump Shot
## 7 Derrick Rose rebound Rose REBOUND (Off:0 Def:1)
## 8 Kevin Love miss MISS Love 16' Jump Shot
## 9 Jaylen Brown rebound Brown REBOUND (Off:0 Def:1)
## 10 Jayson Tatum miss MISS Tatum 3' Cutting Layup Shot, James BLOC~
## # ... with 990 more rows
The three columns highlighted above are generally our best bets for information about the play. When we were working with point values directly, there were some other columns of interest, but for anything that isn’t a change in point values, these seem to be where it’s at. As you can see, there’s information on whether there was a TO, whether someone got a rebound, block, etc. So what we need to do is extract that.
Our first step is generating indicator functions from these columns to tell us when an event occurred. For some types, this is easier than others. For example, we can see pretty clearly that the event_type
column gives us a ready-made indicator of when someone made a basket or missed a basket. Made baskets are coded as shot
and missed baskets are coded as miss
. Thus, we build columns for made and missed field goals:
tmp %>%
mutate(fgmade = event_type == "shot",
fgmiss = event_type == "miss") %>%
select(player, fgmade, fgmiss)
## # A tibble: 1,000 x 3
## player fgmade fgmiss
## <chr> <lgl> <lgl>
## 1 <NA> FALSE FALSE
## 2 Kevin Love FALSE FALSE
## 3 Kyrie Irving TRUE FALSE
## 4 Derrick Rose FALSE TRUE
## 5 Al Horford FALSE FALSE
## 6 Gordon Hayward FALSE TRUE
## 7 Derrick Rose FALSE FALSE
## 8 Kevin Love FALSE TRUE
## 9 Jaylen Brown FALSE FALSE
## 10 Jayson Tatum FALSE TRUE
## # ... with 990 more rows
Free throws and 3-point baskets are a bit more complicated because required information resides in two columns. Specifically, we need to dive into the description
column to dig out whether the field goal was a 3-point shot and whether the free throw was made or missed. (Disclaimer: As is often the case when analyzing data, I have no idea why the data generators chose to code their data in this way!)
So we do this in a couple of steps. For 3-pointers, we first identify whether each field goal was a 3-pointer. Then, made field goals that were also 3’s were counted as made 3s and likewise with missed field goals. Here, the purrr
and stringr
packages come in handy for extracting particular character strings from the the description
column.
tmp %>%
mutate(fgmade = event_type == "shot",
fgmiss = event_type == "miss",
shot3 = map_lgl(description, str_detect, "3PT"),
made3 = shot3 & fgmade,
miss3 = shot3 & fgmiss,
shotft = event_type == "free throw",
missathing = map_lgl(description, str_detect, "MISS"),
madeft = shotft & !missathing,
missft = shotft & missathing) %>%
select(player, made3, miss3, madeft, missft) %>%
slice(921:930)
## # A tibble: 10 x 5
## player made3 miss3 madeft missft
## <chr> <lgl> <lgl> <lgl> <lgl>
## 1 Eric Gordon FALSE TRUE FALSE FALSE
## 2 <NA> FALSE FALSE FALSE FALSE
## 3 Shaun Livingston FALSE FALSE FALSE FALSE
## 4 PJ Tucker FALSE FALSE TRUE FALSE
## 5 PJ Tucker FALSE FALSE TRUE FALSE
## 6 Stephen Curry FALSE FALSE FALSE FALSE
## 7 James Harden FALSE TRUE FALSE FALSE
## 8 <NA> FALSE FALSE FALSE FALSE
## 9 PJ Tucker FALSE FALSE FALSE FALSE
## 10 <NA> FALSE FALSE FALSE FALSE
We can use these columns for totaling up the number of field goals, free throws, and 3-pointers each player took and made each game!
Events with multiple players involved
The other tricky thing is dealing with events where multiple players were involved. For example:
tmp %>%
select(player, event_type, description) %>%
filter(event_type == "turnover" | event_type == "shot") %>%
slice(11:20)
## # A tibble: 10 x 3
## player event_type description
## <chr> <chr> <chr>
## 1 Dwyane Wade turnover Wade Traveling Turnover (P1.T2)
## 2 Jaylen Brown shot Brown 1' Layup (4 PTS)
## 3 Jaylen Brown turnover Brown Step Out of Bounds Turnover (P2.T2)
## 4 LeBron James shot James 22' Jump Shot (2 PTS)
## 5 Kyrie Irving shot Irving 16' Pullup Jump Shot (6 PTS)
## 6 Derrick Rose shot Rose 1' Driving Layup (4 PTS)
## 7 LeBron James shot James 1' Running Layup (4 PTS)
## 8 Jaylen Brown shot Brown 1' Layup (6 PTS) (Irving 2 AST)
## 9 Tristan Thompson shot Thompson Alley Oop Layup (2 PTS) (Smith 1 ~
## 10 JR Smith shot Smith 3PT Jump Shot (3 PTS) (James 3 AST)
For some turnovers, there’s also a steal. For some made baskets, there’s an assist. In these cases, our data set records an event that involves multiple players accumulating stats simultaneously. However, only one player (the player committing the turnover or making the basket respectively) is recorded in the player
column. The exploits of the other player are only referenced in the description
! So we’ll need to go into there to ferret out the data we want.
Doing this actually requires a lot of work! The basic steps we need to take are below:
- Figure out when an event we care about (for now, say a Steal) occurred
- Figure out if that event was caused by a player on the team we’re looking at
- If so, attribute that event to that player
- Build a new row in our data frame containing information on that event for easy processing later
For (1), we’ll exploit the consistent way in which these events are written in the description. For steals and blocks, the description
is written as “[player last name] STEAL”. (Block are done the same way.) For assists, it’s “([player last name] [integer] AST)”. These lend themselves well to some good ’ole regular expressions!
tmp %>%
mutate(gotit = map_chr(description, str_extract, paste("[:alpha:]+", "STEAL"))) %>%
filter(!is.na(gotit)) %>%
select(player, description, gotit)
## # A tibble: 29 x 3
## player description gotit
## <chr> <chr> <chr>
## 1 Derrick Ro~ Irving STEAL (1 STL), Rose Lost Ball Turnover ~ Irving STEAL
## 2 Aron Baynes Baynes Lost Ball Turnover (P1.T3), Shumpert ST~ Shumpert ST~
## 3 Kevin Love Irving STEAL (2 STL), Love Bad Pass Turnover (~ Irving STEAL
## 4 Kyrie Irvi~ Irving Bad Pass Turnover (P1.T7), Crowder STEA~ Crowder STE~
## 5 Jae Crowder Smart STEAL (1 STL), Crowder Lost Ball Turnove~ Smart STEAL
## 6 Derrick Ro~ Brown STEAL (1 STL), Rose Bad Pass Turnover (P~ Brown STEAL
## 7 Dwyane Wade Smart STEAL (2 STL), Wade Bad Pass Turnover (P~ Smart STEAL
## 8 Dwyane Wade Brown STEAL (2 STL), Wade Bad Pass Turnover (P~ Brown STEAL
## 9 Dwyane Wade Irving STEAL (3 STL), Wade Bad Pass Turnover (~ Irving STEAL
## 10 LeBron Jam~ Rozier STEAL (1 STL), James Bad Pass Turnover ~ Rozier STEAL
## # ... with 19 more rows
Using purrr::map_chr
and stringr::str_extract
, we’re able to pick out the relevant bit of information we want from these data points. Next, we add a little bit of processing to separate the player from the event and then use the player’s full name rather than just the last name. This latter step allows us to be consistent with the conventions for other events (like field goals, above).
evplayer <- "STEAL player"
tmp %>%
mutate(allplayers = paste(a1, a2, a3, a4, a5, sep = ","),
gotit = map_chr(description, str_extract, paste("[:alpha:]+", "STEAL")),
evlastname = map_chr(gotit, str_sub, end = -7),
regexname = paste("[:alpha:]+", evlastname),
!!evplayer := map2_chr(allplayers, regexname, str_extract)) %>%
filter(!is.na(gotit)) %>%
select(-allplayers, -gotit, -evlastname, -regexname) %>%
select(description, `STEAL player`)
## # A tibble: 29 x 2
## description `STEAL player`
## <chr> <chr>
## 1 Irving STEAL (1 STL), Rose Lost Ball Turnover (P1.T1) Kyrie Irving
## 2 Baynes Lost Ball Turnover (P1.T3), Shumpert STEAL (1 STL) <NA>
## 3 Irving STEAL (2 STL), Love Bad Pass Turnover (P1.T5) Kyrie Irving
## 4 Irving Bad Pass Turnover (P1.T7), Crowder STEAL (1 STL) <NA>
## 5 Smart STEAL (1 STL), Crowder Lost Ball Turnover (P1.T6) Marcus Smart
## 6 Brown STEAL (1 STL), Rose Bad Pass Turnover (P2.T7) Jaylen Brown
## 7 Smart STEAL (2 STL), Wade Bad Pass Turnover (P2.T8) Marcus Smart
## 8 Brown STEAL (2 STL), Wade Bad Pass Turnover (P3.T9) Jaylen Brown
## 9 Irving STEAL (3 STL), Wade Bad Pass Turnover (P4.T10) Kyrie Irving
## 10 Rozier STEAL (1 STL), James Bad Pass Turnover (P3.T12) Terry Rozier
## # ... with 19 more rows
The !!
in the code above is from rlang
and allows us to name columns in tidyverse
-style functions like mutate
. It doesn’t make much sense in the above chunk, but is super useful if we want to write a general function that will let us re-use the code above for multiple types of events (like blocks!):
get_event_and_player <- function(ab, event){
evplayer <- paste(event, "player")
ab %>%
mutate(allplayers = paste(a1, a2, a3, a4, a5, sep = ","),
gotit = map_chr(description, str_extract, paste("[:alpha:]+", event)),
evlastname = map_chr(gotit, str_sub, end = -7),
regexname = paste("[:alpha:]+", evlastname),
!!evplayer := map2_chr(allplayers, regexname, str_extract)) %>%
select(-allplayers, -gotit, -evlastname, -regexname)
}
Because assists are coded differently, I wrote a separate function for them:
get_ast_and_player <- function(ab){
ab %>%
mutate(allplayers = paste(a1, a2, a3, a4, a5, sep = ","),
gotast = map_chr(description, str_extract, paste("[:alpha:]+ [:digit:] AST")),
astlastname = map_chr(gotast, str_sub, end = -7),
regexname = paste("[:alpha:]+", astlastname),
`ASSIST player` = map2_chr(allplayers, regexname, str_extract)) %>%
select(-allplayers, -gotast, -astlastname, -regexname)
}
Each of these functions will now generate new columns describing events with multiple players involved:
tmp %>%
get_ast_and_player() %>%
select(description, `ASSIST player`) %>% slice(3)
## # A tibble: 1 x 2
## description `ASSIST player`
## <chr> <chr>
## 1 Irving 10' Driving Floating Jump Shot (2 PTS) (Horford 1~ Al Horford
The next step is to re-format these columns to be consistent with the ways in which other data are recorded. To do that, we’ll generate all of our new columns and use a gather
to re-format them with the event type in one column and the player name in another. We’ll name these description
and player
respectively, and then append these new rows to our original data frame. This new data frame will then have simple tags in description
and player
telling us what event occurred and who did it, which is what we already have for things like made field goals and turnovers!
The function below does the above in a single go:
get_ast_stl_blk <- function(tb){
tb %>%
get_event_and_player("BLOCK") %>%
get_event_and_player("STEAL") %>%
get_ast_and_player() %>%
mutate(dropit = pmap_lgl(list(`BLOCK player`, `STEAL player`, `ASSIST player`),
function(a, b, d) !(is.na(a) &&
is.na(b) &&
is.na(d)
)
)) %>% filter(dropit) %>%
select(`BLOCK player`, `STEAL player`, `ASSIST player`, game_id,
date, playoffs, mplay_id, possessioncount) %>%
gather(-game_id, -mplay_id, -possessioncount, -date,
-playoffs,
key = "description", value = "player") %>%
filter(!is.na(player)) %>%
mutate(description = map_chr(description, str_sub, end = -8)) %>%
bind_rows(tb)
}
We can run this function on our raw data to create new rows, and then throw that data table into our cleaning function as before.
Putting it all together
Finally, we build a function that will be added to our cleaning step that creates columns for each event we’re interested in. This is basically wrapping together all of the stuff we did above:
add_simple_stat_indicators <- function(tb){
tb %>%
mutate(
gotblk = (description == "BLOCK"),
gotstl = (description == "STEAL"),
gotast = (description == "ASSIST"),
gotreb = map_lgl(description, str_detect, "REBOUND"),
tfoulu = map_lgl(description, str_detect, "T.FOUL"),
tfoull = map_lgl(description, str_detect, "T.Foul"),
fgmade = event_type == "shot",
fgmiss = event_type == "miss",
shotft = event_type == "free throw",
foul = event_type == "foul",
turnover = event_type == "turnover",
shot3 = map_lgl(description, str_detect, "3PT"),
made3 = map2_lgl(shot3, fgmade, function(a, b) a && b),
miss3 = map2_lgl(shot3, fgmiss, function(a, b) a && b),
missathing = map_lgl(description, str_detect, "MISS"),
madeft = map2_lgl(shotft, !missathing, function(a, b) a && b),
missft = map2_lgl(shotft, missathing, function(a, b) a && b),
tfoul = map2_lgl(tfoulu, tfoull, function(a, b) a | b),
pfoul = map2_lgl(foul, !tfoul , function(a, b) a && b))
}
This function takes all of the things we covered above (plus a few others!) and wraps them into a convenient function that we can apply to our raw data set in the cleaning portion.
Wrapping up
And unfortunately, there’s where I’m going to leave things for now! Next time, we’ll take these functions, build game logs, and then from those game logs, compute our simple summary statistics.
My main take-away from today is that even curated data are always going to be messy. The original data set treated in-game “events” as their unit of measure, with each row referencing a single event. However, as we discovered, often times, single events involve multiple players, and we want information on all of the players involved! Therefore, we had to do some slashing and chopping of the data to get where we wanted to go. purrr
and stringr
were very useful for this, and we even got to use a bit of rlang
when we wanted to write a single, general function that we could apply to multiple types of events.
I think we also learned that sometimes even seemingly simple questions like, “How many assists did Player X get?” can require a fair bit of work before we can even start to answer them!