Basketball is a multimillion dollar sports that fascinates millions of people all year round peaking in March Madness each year. Here, we are using play-by-play data on all NBA games on two days in Oct 2017.
Read the data from sample-combined-pbp-stats.csv without downloading the file locally. Each line of the file describes one play in a game.
game_id
. How many games are recorded in the data and how
many plays (number of rows) does a game have on average?# write your code here
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pbp <- read.csv("https://raw.githubusercontent.com/DS202-at-ISU/DS202-at-ISU.github.io/master/exams/sample-combined-pbp-stats.csv")
dim(pbp)
## [1] 2386 44
length(unique(pbp$game_id))
## [1] 5
pbp %>% group_by(game_id) %>% tally() %>% summary()
## game_id n
## Min. :21700001 Min. :440.0
## 1st Qu.:21700002 1st Qu.:457.0
## Median :21700003 Median :477.0
## Mean :21700003 Mean :477.2
## 3rd Qu.:21700004 3rd Qu.:488.0
## Max. :21700005 Max. :524.0
# baseic fact-finding questions - make sure to actually write answers to each question, i.e. there are 49 variables and 2386 plays overall. ...
For each game, the variable points
keeps track of the
number of points attempted in each play.
Use functions from the dplyr
package to work out the
number of points each team scored (check result
). Filter
out all plays that are not associated with any of the teams.
Compare your scores with the final values of home and away scores
(check for event_type
and period
)
Using ggplot2
plot scores by team. Sort teams by
their median scores.
In a second step, use the scores to identify winning and losing
teams of each game (Hint: use dplyr again and remember what
which.max
and which.min
are doing).
player
keeps track of which player makes a shot. For
each game, identify which player in each team scored the most
points.
scores <- pbp %>% group_by(game_id, team) %>%
filter(result=="made") %>%
summarise(
score = sum(points),
.groups = 'drop'
)
pbp %>% filter(event_type=="end of period", period==4) %>% select(home_score, away_score, game_id)
## home_score away_score game_id
## 1 102 99 21700001
## 2 121 122 21700002
## 3 102 90 21700003
## 4 140 131 21700004
## 5 116 109 21700005
scores %>%
ggplot(aes(reorder(team, score, median), score)) + geom_point() + coord_flip()
winners <- scores %>%
mutate(.by = game_id,
winner = team[which.max(score)],
loser = team[which.min(score)])
players <- pbp %>% group_by(game_id, team, player) %>%
filter(result=="made") %>%
summarise(
score = sum(points),
.groups = 'drop'
) %>%
group_by(game_id, team) %>%
mutate(
mvp = rank(-score)
) %>%
filter(mvp==1)
players
## # A tibble: 10 × 5
## # Groups: game_id, team [10]
## game_id team player score mvp
## <int> <chr> <chr> <int> <dbl>
## 1 21700001 BOS Jaylen Brown 25 1
## 2 21700001 CLE LeBron James 29 1
## 3 21700002 GSW Nick Young 23 1
## 4 21700002 HOU James Harden 27 1
## 5 21700003 CHA Kemba Walker 24 1
## 6 21700003 DET Tobias Harris 27 1
## 7 21700004 BKN D'Angelo Russell 30 1
## 8 21700004 IND Victor Oladipo 22 1
## 9 21700005 MIA Hassan Whiteside 26 1
## 10 21700005 ORL Evan Fournier 23 1
# James LeBron has only the second highest score in these four games.
The variables converted_x
and converted_y
give the location of the acting player on the court. Check with the
variable event_type
to see, for which types of play we have
geographic information.
For game number 21700003
plot the geographic location of
each play on the court, colour by team and incorporate visually the
play’s outcome (variable result
). Describe the result.
pbp %>% group_by(event_type, is.na(converted_x), is.na(converted_y)) %>% tally()
## # A tibble: 13 × 4
## # Groups: event_type, is.na(converted_x) [13]
## event_type `is.na(converted_x)` `is.na(converted_y)` n
## <chr> <lgl> <lgl> <int>
## 1 end of period TRUE TRUE 20
## 2 foul TRUE TRUE 221
## 3 free throw TRUE TRUE 241
## 4 jump ball TRUE TRUE 5
## 5 miss FALSE FALSE 486
## 6 rebound TRUE TRUE 532
## 7 shot FALSE FALSE 419
## 8 start of period TRUE TRUE 20
## 9 sub TRUE TRUE 221
## 10 timeout TRUE TRUE 55
## 11 turnover TRUE TRUE 147
## 12 unknown TRUE TRUE 7
## 13 violation TRUE TRUE 12
# shots and mises have geographic information
# an alternative solution
pbp %>% group_by(event_type) %>%
summarise(
x_missing = sum(is.na(converted_x)),
y_missing = sum(is.na(converted_y))
)
## # A tibble: 13 × 3
## event_type x_missing y_missing
## <chr> <int> <int>
## 1 end of period 20 20
## 2 foul 221 221
## 3 free throw 241 241
## 4 jump ball 5 5
## 5 miss 0 0
## 6 rebound 532 532
## 7 shot 0 0
## 8 start of period 20 20
## 9 sub 221 221
## 10 timeout 55 55
## 11 turnover 147 147
## 12 unknown 7 7
## 13 violation 12 12
pbp %>% filter(game_id==21700003) %>%
ggplot(aes(x = converted_x, y = converted_y, colour=team, shape=result)) +
geom_point() + coord_equal()
## Warning: Removed 271 rows containing missing values (`geom_point()`).
# We see a clear pattern in the shots made and attempted. There are two tight clusters of shots attempted at the baskets, which spread out as distance increases. At the 3 point lines we can see another set of attempted shots.
Is there any evidence that shots closer to the basket are successful
more often? For that, - introduce a new variable d
into the
data that captures the distance of a player from the basket (basket is
in [0,0] for variables original_x
and
original_y
). - draw side-by-side boxplots of distance by
result and team (using ggplot2). Interpret.
pbp$d <- with(pbp, sqrt(original_x^2+original_y^2))
pbp %>% filter(team!="") %>%
ggplot(aes(x = result, y = d)) +
facet_grid(.~team) + geom_boxplot()
## Warning: Removed 1299 rows containing non-finite values (`stat_boxplot()`).
# there is a clear difference in the distance associated with shots made and shots missed. Distance does hurt. Shots that were made tend to be closer to the basket.
The variable elapsed
is recorded in hour-minute-second
format. Convert the information into seconds (hint: you could introduce
helper variables for the conversion). The elapsed
time
variable starts over in each period (a period has 12 mins). Introduce a
new variable called time_played
that keeps track of the
time (in seconds) from the beginning of a game to the end.
Plot the timeline (time_played
) of events
(event_type
) for game with the id 21700002
in
a scatterplot. Colour by team. Comment on the result.
library(lubridate)
pbp <- pbp %>% separate(elapsed, into=c("hour", "minute", "second"), sep=":") %>%
mutate(
elapsed = as.numeric(minute)*60+as.numeric(second),
time_played = elapsed + 12*60*(period-1)
)
pbp %>%
filter(game_id=="21700002") %>%
ggplot(aes(x = time_played, y = event_type, colour=team)) +
geom_point() +
facet_grid(period~.)
# there should be clear changes between home and away team showing teams' plays, but on this scale this is hard to see.
Variables h1
through h5
and a1
through a5
are the five players of the home team and the
away team on the field at that moment in positions 1 through 5. Reshape
the data set to combine all player variables. Extract position numbers.
For each game identify how many players were playing in each
position.
Draw side-by-side boxplots of the number of players by position. Comment.
players <- pbp %>% pivot_longer(a1:h5, names_to = "teamposition", values_to = "position_player") %>%
mutate(
position = parse_number(teamposition)
) %>%
group_by(position, game_id) %>%
summarise(num_players = length(unique(position_player)))
## `summarise()` has grouped output by 'position'. You can override using the
## `.groups` argument.
players %>% ggplot(aes(x = factor(position), y=num_players)) +
geom_boxplot()
# quite a few players are exchanged in each position throughout a game (probably about half are from each of the teams - ideally we should distinguish between home and away team)
# position #1 has the fewest players,
# followed by position #2.
# the other positions see more changes.