March Madness

Basketball is a multimillion dollar sports that fascinates millions of people all year round peaking in March Madness each year. Here, we are using play-by-play data on all NBA games on two days in Oct 2017.

Question #1: First Overview

Read the data from sample-combined-pbp-stats.csv without downloading the file locally. Each line of the file describes one play in a game.

  • How many rows and columns does the data have? What are the variables called?
  • each individual game is assigned a unique identifier game_id. How many games are recorded in the data and how many plays (number of rows) does a game have on average?
# write your code here
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pbp <- read.csv("https://raw.githubusercontent.com/DS202-at-ISU/DS202-at-ISU.github.io/master/exams/sample-combined-pbp-stats.csv")
dim(pbp)
## [1] 2386   44
length(unique(pbp$game_id))
## [1] 5
pbp %>% group_by(game_id) %>% tally() %>% summary()
##     game_id               n        
##  Min.   :21700001   Min.   :440.0  
##  1st Qu.:21700002   1st Qu.:457.0  
##  Median :21700003   Median :477.0  
##  Mean   :21700003   Mean   :477.2  
##  3rd Qu.:21700004   3rd Qu.:488.0  
##  Max.   :21700005   Max.   :524.0
# baseic fact-finding questions - make sure to actually write answers to each question, i.e. there are 49 variables and 2386 plays overall. ...

Question #2: Who won?

For each game, the variable points keeps track of the number of points attempted in each play.

  • Use functions from the dplyr package to work out the number of points each team scored (check result). Filter out all plays that are not associated with any of the teams.

  • Compare your scores with the final values of home and away scores (check for event_type and period)

  • Using ggplot2 plot scores by team. Sort teams by their median scores.

  • In a second step, use the scores to identify winning and losing teams of each game (Hint: use dplyr again and remember what which.max and which.min are doing).

  • player keeps track of which player makes a shot. For each game, identify which player in each team scored the most points.

scores <- pbp %>% group_by(game_id, team) %>% 
  filter(result=="made") %>%
  summarise(
    score = sum(points),
    .groups = 'drop'
  )

pbp %>% filter(event_type=="end of period", period==4) %>% select(home_score, away_score, game_id)
##   home_score away_score  game_id
## 1        102         99 21700001
## 2        121        122 21700002
## 3        102         90 21700003
## 4        140        131 21700004
## 5        116        109 21700005
scores %>%
  ggplot(aes(reorder(team, score, median), score)) + geom_point() + coord_flip()

winners <- scores %>% 
  mutate(.by = game_id,
         winner = team[which.max(score)], 
         loser = team[which.min(score)])

players <- pbp %>% group_by(game_id, team, player) %>% 
  filter(result=="made") %>%
  summarise(
    score = sum(points),
    .groups = 'drop'
  ) %>%
  group_by(game_id, team) %>%
  mutate(
    mvp = rank(-score)
  ) %>% 
  filter(mvp==1)

players
## # A tibble: 10 × 5
## # Groups:   game_id, team [10]
##     game_id team  player           score   mvp
##       <int> <chr> <chr>            <int> <dbl>
##  1 21700001 BOS   Jaylen Brown        25     1
##  2 21700001 CLE   LeBron James        29     1
##  3 21700002 GSW   Nick Young          23     1
##  4 21700002 HOU   James Harden        27     1
##  5 21700003 CHA   Kemba Walker        24     1
##  6 21700003 DET   Tobias Harris       27     1
##  7 21700004 BKN   D'Angelo Russell    30     1
##  8 21700004 IND   Victor Oladipo      22     1
##  9 21700005 MIA   Hassan Whiteside    26     1
## 10 21700005 ORL   Evan Fournier       23     1
# James LeBron has only the second highest score in these four games. 

Question #3 : On the court

The variables converted_x and converted_y give the location of the acting player on the court. Check with the variable event_type to see, for which types of play we have geographic information.

For game number 21700003 plot the geographic location of each play on the court, colour by team and incorporate visually the play’s outcome (variable result). Describe the result.

pbp %>% group_by(event_type, is.na(converted_x), is.na(converted_y)) %>% tally()
## # A tibble: 13 × 4
## # Groups:   event_type, is.na(converted_x) [13]
##    event_type      `is.na(converted_x)` `is.na(converted_y)`     n
##    <chr>           <lgl>                <lgl>                <int>
##  1 end of period   TRUE                 TRUE                    20
##  2 foul            TRUE                 TRUE                   221
##  3 free throw      TRUE                 TRUE                   241
##  4 jump ball       TRUE                 TRUE                     5
##  5 miss            FALSE                FALSE                  486
##  6 rebound         TRUE                 TRUE                   532
##  7 shot            FALSE                FALSE                  419
##  8 start of period TRUE                 TRUE                    20
##  9 sub             TRUE                 TRUE                   221
## 10 timeout         TRUE                 TRUE                    55
## 11 turnover        TRUE                 TRUE                   147
## 12 unknown         TRUE                 TRUE                     7
## 13 violation       TRUE                 TRUE                    12
# shots and mises have geographic information

# an alternative solution
pbp %>% group_by(event_type) %>% 
  summarise(
    x_missing = sum(is.na(converted_x)),
    y_missing = sum(is.na(converted_y))
  )
## # A tibble: 13 × 3
##    event_type      x_missing y_missing
##    <chr>               <int>     <int>
##  1 end of period          20        20
##  2 foul                  221       221
##  3 free throw            241       241
##  4 jump ball               5         5
##  5 miss                    0         0
##  6 rebound               532       532
##  7 shot                    0         0
##  8 start of period        20        20
##  9 sub                   221       221
## 10 timeout                55        55
## 11 turnover              147       147
## 12 unknown                 7         7
## 13 violation              12        12
pbp %>% filter(game_id==21700003) %>% 
  ggplot(aes(x = converted_x, y = converted_y, colour=team, shape=result)) + 
  geom_point() + coord_equal()
## Warning: Removed 271 rows containing missing values (`geom_point()`).

# We see a clear pattern in the shots made and attempted. There are two tight clusters of shots attempted at the baskets, which spread out as distance increases. At the 3 point lines we can see another set of attempted shots. 

Question #4: Distance hurts?

Is there any evidence that shots closer to the basket are successful more often? For that, - introduce a new variable d into the data that captures the distance of a player from the basket (basket is in [0,0] for variables original_x and original_y). - draw side-by-side boxplots of distance by result and team (using ggplot2). Interpret.

pbp$d <- with(pbp, sqrt(original_x^2+original_y^2))
pbp %>% filter(team!="") %>% 
  ggplot(aes(x = result, y = d)) +
  facet_grid(.~team) + geom_boxplot()
## Warning: Removed 1299 rows containing non-finite values (`stat_boxplot()`).

# there is a clear difference in the distance associated with shots made and shots missed. Distance does hurt. Shots that were made tend to be closer to the basket.

Question #5: Timeline

The variable elapsed is recorded in hour-minute-second format. Convert the information into seconds (hint: you could introduce helper variables for the conversion). The elapsed time variable starts over in each period (a period has 12 mins). Introduce a new variable called time_played that keeps track of the time (in seconds) from the beginning of a game to the end.

Plot the timeline (time_played) of events (event_type) for game with the id 21700002 in a scatterplot. Colour by team. Comment on the result.

library(lubridate)
pbp <- pbp %>% separate(elapsed, into=c("hour", "minute", "second"), sep=":") %>%
  mutate(
    elapsed = as.numeric(minute)*60+as.numeric(second),
    time_played = elapsed + 12*60*(period-1)
  )
pbp %>% 
  filter(game_id=="21700002") %>%
  ggplot(aes(x = time_played, y = event_type,  colour=team)) + 
  geom_point() +
  facet_grid(period~.)

# there should be clear changes  between home and away team showing teams' plays, but on this scale this is hard to see.

Question #6: Who is playing?

Variables h1 through h5 and a1 through a5 are the five players of the home team and the away team on the field at that moment in positions 1 through 5. Reshape the data set to combine all player variables. Extract position numbers. For each game identify how many players were playing in each position.

Draw side-by-side boxplots of the number of players by position. Comment.

players <- pbp %>% pivot_longer(a1:h5, names_to = "teamposition", values_to = "position_player") %>%
  mutate(
    position = parse_number(teamposition)
  ) %>%
  group_by(position, game_id) %>%
  summarise(num_players = length(unique(position_player)))
## `summarise()` has grouped output by 'position'. You can override using the
## `.groups` argument.
players %>% ggplot(aes(x = factor(position), y=num_players))  +
  geom_boxplot() 

# quite a few players are exchanged in each position throughout a game (probably about half are from each of the teams - ideally we should distinguish between home and away team)
# position #1 has the fewest players,
# followed by position #2. 
# the other positions see more changes.