DIY Metrics: Preparing a new data set

So I finally broke down and got a full season’s worth of NBA play-by-play data to work on. Going forward, I’ll be using the full 2017-2018 play-by-play data from NBAstuffer.

To date, I’ve been building my scripts using functional programming with the goal of having each step easily work with new data sets. This will be a good test of whether I’ve been successful!

But before we do that, we need to look at the new data set and see what, if anything has changed. This means a new round of data cleaning!

Cleaning a new data set

The data set we’re dealing with is large enough that I don’t think my free spreadsheet software can handle it. So the first thing I do is try to read it into R and see what breaks:

library(tidyverse)
tbraw <- read_csv("data/[10-17-2017]-[06-08-2018]-combined-stats.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   period = col_double(),
##   away_score = col_double(),
##   home_score = col_double(),
##   remaining_time = col_time(format = ""),
##   elapsed = col_time(format = ""),
##   play_id = col_double(),
##   num = col_double(),
##   outof = col_double(),
##   points = col_double(),
##   shot_distance = col_double(),
##   original_x = col_double(),
##   original_y = col_double(),
##   converted_x = col_double(),
##   converted_y = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 676 parsing failures.
##   row         col   expected     actual                                                file
## 11390 original_x  a double   unknown    'data/[10-17-2017]-[06-08-2018]-combined-stats.csv'
## 11390 original_y  a double   unknown    'data/[10-17-2017]-[06-08-2018]-combined-stats.csv'
## 11390 converted_x a double   unknown    'data/[10-17-2017]-[06-08-2018]-combined-stats.csv'
## 11390 converted_y a double   unknown    'data/[10-17-2017]-[06-08-2018]-combined-stats.csv'
## 34256 NA          44 columns 46 columns 'data/[10-17-2017]-[06-08-2018]-combined-stats.csv'
## ..... ........... .......... .......... ...................................................
## See problems(...) for more details.

Three things jump out to me: The first is that the data has remarkably few problems going in. The second is that the game_id’s are flubbed, and I"m not sure why. We can fix that easily enough:

tbraw %>% 
  mutate(game_id = as.numeric(str_extract(game_id, "\\d+"))) %>% head
## # A tibble: 6 x 44
##   game_id data_set date  a1    a2    a3    a4    a5    h1    h2    h3   
##     <dbl> <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1  2.17e7 2017-20~ 2017~ Gord~ Jays~ Al H~ Jayl~ Kyri~ LeBr~ Jae ~ Kevi~
## 2  2.17e7 2017-20~ 2017~ Gord~ Jays~ Al H~ Jayl~ Kyri~ LeBr~ Jae ~ Kevi~
## 3  2.17e7 2017-20~ 2017~ Gord~ Jays~ Al H~ Jayl~ Kyri~ LeBr~ Jae ~ Kevi~
## 4  2.17e7 2017-20~ 2017~ Gord~ Jays~ Al H~ Jayl~ Kyri~ LeBr~ Jae ~ Kevi~
## 5  2.17e7 2017-20~ 2017~ Gord~ Jays~ Al H~ Jayl~ Kyri~ LeBr~ Jae ~ Kevi~
## 6  2.17e7 2017-20~ 2017~ Gord~ Jays~ Al H~ Jayl~ Kyri~ LeBr~ Jae ~ Kevi~
## # ... with 33 more variables: h4 <chr>, h5 <chr>, period <dbl>,
## #   away_score <dbl>, home_score <dbl>, remaining_time <time>,
## #   elapsed <time>, play_length <chr>, play_id <dbl>, team <chr>,
## #   event_type <chr>, assist <chr>, away <chr>, home <chr>, block <chr>,
## #   entered <chr>, left <chr>, num <dbl>, opponent <chr>, outof <dbl>,
## #   player <chr>, points <dbl>, possession <chr>, reason <chr>,
## #   result <chr>, steal <chr>, type <chr>, shot_distance <dbl>,
## #   original_x <dbl>, original_y <dbl>, converted_x <dbl>,
## #   converted_y <dbl>, description <chr>

The third thing is all of the “problems”

problems(tbraw)
## # A tibble: 676 x 5
##      row col        expected   actual   file                               
##    <int> <chr>      <chr>      <chr>    <chr>                              
##  1 11390 original_x a double   unknown  'data/[10-17-2017]-[06-08-2018]-co~
##  2 11390 original_y a double   unknown  'data/[10-17-2017]-[06-08-2018]-co~
##  3 11390 converted~ a double   unknown  'data/[10-17-2017]-[06-08-2018]-co~
##  4 11390 converted~ a double   unknown  'data/[10-17-2017]-[06-08-2018]-co~
##  5 34256 <NA>       44 columns 46 colu~ 'data/[10-17-2017]-[06-08-2018]-co~
##  6 34257 <NA>       44 columns 46 colu~ 'data/[10-17-2017]-[06-08-2018]-co~
##  7 34258 <NA>       44 columns 46 colu~ 'data/[10-17-2017]-[06-08-2018]-co~
##  8 34259 <NA>       44 columns 46 colu~ 'data/[10-17-2017]-[06-08-2018]-co~
##  9 34260 <NA>       44 columns 46 colu~ 'data/[10-17-2017]-[06-08-2018]-co~
## 10 34261 <NA>       44 columns 46 colu~ 'data/[10-17-2017]-[06-08-2018]-co~
## # ... with 666 more rows

Extra columns

Going back to point 1, 676 “problems” out of a total of nearly 600k rows is not very bad at all. But we still need to address them. It looks like there are a couple categories of problems. The easiest to deal with might be the rows where there are two extra columns. It looks like these are all consecutive, so I’m going to go out on a limb and guess that there’s a single game that, for whatever reason, we read in as having two extra rows. Let’s start by looking at that game:

tbcols <- tbraw %>% slice(
  (problems(tbraw) %>% filter(actual == "46 columns") %>% select(row) %>% unique)$row
  ) 

tbcols %>% select(game_id) %>% unique
## # A tibble: 1 x 1
##   game_id   
##   <chr>     
## 1 0021700074

Yeah, definitely just a single game. Visual inspection doesn’t indicate that anything’s missing, so we might be able to ignore these “problems” and move on, but let’s look at one more thing before we do so:

con <- file("data/[10-17-2017]-[06-08-2018]-combined-stats.csv", "r")
ab <- read.csv(con, header = F,
         skip = min((problems(tbraw) %>% filter(actual == "46 columns") %>% select(row) %>% unique)$row),
         nrow = nrow(tbcols)) %>% 
  as_tibble()
close(con)
ab
## # A tibble: 486 x 46
##        V1 V2    V3    V4    V5    V6    V7    V8    V9    V10   V11   V12  
##     <int> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
##  1 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  2 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  3 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  4 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  5 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  6 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  7 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  8 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
##  9 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
## 10 2.17e7 2017~ 10/2~ Paul~ Carm~ Stev~ Andr~ Russ~ Andr~ Taj ~ Karl~ Jimm~
## # ... with 476 more rows, and 34 more variables: V13 <fct>, V14 <int>,
## #   V15 <int>, V16 <int>, V17 <fct>, V18 <fct>, V19 <fct>, V20 <int>,
## #   V21 <fct>, V22 <fct>, V23 <fct>, V24 <fct>, V25 <fct>, V26 <fct>,
## #   V27 <fct>, V28 <fct>, V29 <int>, V30 <fct>, V31 <int>, V32 <fct>,
## #   V33 <int>, V34 <fct>, V35 <fct>, V36 <fct>, V37 <fct>, V38 <fct>,
## #   V39 <int>, V40 <int>, V41 <int>, V42 <dbl>, V43 <dbl>, V44 <fct>,
## #   V45 <lgl>, V46 <lgl>

The code above reads in only the rows of our original .csv that we identified earlier as having two extra columns. I used read.csv instead of read_csv because, as far as I can tell, there isn’t a way to use the latter to read in a specific subset of the target data file. Please correct me if I’m wrong.

Now we can look directly at what’s in those columns:

ab %>% select(V45, V46)
## # A tibble: 486 x 2
##    V45   V46  
##    <lgl> <lgl>
##  1 NA    NA   
##  2 NA    NA   
##  3 NA    NA   
##  4 NA    NA   
##  5 NA    NA   
##  6 NA    NA   
##  7 NA    NA   
##  8 NA    NA   
##  9 NA    NA   
## 10 NA    NA   
## # ... with 476 more rows

As I suspected, they’re just empty columns that got included in extraneously for some reason. Probably just an error when the data set was being built. No big deal. I think this is the kind of thing that can happen when you’re working in Excel and you click in a cell beyond the scope of your data. At that point, Excel decides you “care” about those rows/columns even though you haven’t bothered to put anything into them. Then when you convert to .csv, it includes those cells in the file as “empty” cells. Bottom line is we can ignore them.

At this point, I fee safe ignoring the parsing problems associated with this game file, reducing our list greatly!

Other problems

We’re left with 190 “problems”, each of which is related to an “unknown” entry type being located where a “double” was expected:

pr <- problems(tbraw) %>% filter(actual != "46 columns")
pr
## # A tibble: 190 x 5
##       row col         expected actual  file                                
##     <int> <chr>       <chr>    <chr>   <chr>                               
##  1  11390 original_x  a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  2  11390 original_y  a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  3  11390 converted_x a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  4  11390 converted_y a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  5 392313 shot_dista~ a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  6 392313 original_x  a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  7 392313 original_y  a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  8 392313 converted_x a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
##  9 392313 converted_y a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
## 10 421256 original_x  a double unknown 'data/[10-17-2017]-[06-08-2018]-com~
## # ... with 180 more rows

Conveniently, it looks like these problems occur in a few specific columns and only in a few specific games:

pr %>% select(col) %>% unique 
## # A tibble: 5 x 1
##   col          
##   <chr>        
## 1 original_x   
## 2 original_y   
## 3 converted_x  
## 4 converted_y  
## 5 shot_distance
ap <- tbraw %>% slice(
  (pr %>% select(row) %>% unique)$row
)
ap %>% mutate(game_id = as.numeric(str_extract(game_id, "\\d+"))) %>% select(game_id) %>% unique
## # A tibble: 3 x 1
##    game_id
##      <dbl>
## 1 21700025
## 2 21700861
## 3 21700923

All of these problems relate to shot location data, and these problems occur in only three games. When we look at these games individually, and we see that these values are coded as “unknown” in the raw files. The use of a character string is what produced the warning from read_csv, so that’s one mystery solved. Additionally, this indicates to me that there was an error with the cameras and/or software used to measure the shot locations for those specific shots. Since nothing we’ve done to date relies on shot location data, we can safely ignore these problems for now. If we ever want to do anything with shot quality or shot locations, it might be worth asking how these missing data points effect our evaluation, but for now, I’m fine ignoring them.

Data cleaning wrap up

So after all of that, it looks like read_csv did a fine job of parsing our data, and we don’t have to take any specific actions to resolve the problems it tagged! Even though we didn’t end up doing anything to our data set, it’s good to know that we’ve discovered exactly what abnormalities exist with our data set and when we should care about them. We may eventually need to consider the implications of the missing shot location data if we ever get to the point that we want to incorporate shot quality into our DIY statistics, but for now, we’re safe to move on!

Finalizing a cleaned data file

The last thing we have to do is run the code we’d written to clean the original sample data on our full-season data set:

teamsandplayers <- tbraw %>% 
  filter(elapsed > 0) %>% #eliminates weird cases like opening jump ball
  select(player, team) %>% 
  filter(!is.na(player)) %>% 
  distinct() 

tb <- tbraw %>% 
  mutate(hometeam = get_team(h1, teamsandplayers),
         awayteam = get_team(a1, teamsandplayers),
         pointchange = map_lgl(points, score_changed)) 


allteams <- tb %>% 
  filter(elapsed > 0) %>% #eliminates weird cases like opening jump ball
  select(team) %>% 
  filter(!is.na(team)) %>% 
  distinct() 
tmp <- left_join(allteams, get_team_events(allteams, tb))

write_rds(tmp, "clean-data/team-events-1718.rds")

And miracle of miracles, it looks like its worked! We save the cleaned data file and can now run our existing DIY stats code with a full expectation that it’ll work properly!