Basketball is a multimillion dollar sports that fascinates millions of people all year round peaking in March Madness each year. Here, we are using play-by-play data on all NBA games on two days in Oct 2017.
Read the data from sample-combined-pbp-stats.csv
without downloading the file locally. (Hint: ?read.csv
)
Each line of the file describes one play in a game.
game_id
. How many games are recorded in the data and how
many plays (number of rows) does a game have on average?# write your code here
For each game, the variable points
keeps track of the
number of points attempted in each play.
Use functions from the dplyr
package to work out the
number of points each team scored (check result
). Filter
out all plays that are not associated with any of the teams.
Compare your scores with the final values of home and away scores
(check for event_type
and period
)
Using ggplot2
plot scores by team. Sort teams by
their median scores.
In a second step, use the scores to identify winning and losing
teams of each game (Hint: use dplyr again and remember what
which.max
and which.min
are doing).
player
keeps track of which player makes a shot. For
each game, identify which player in each team scored the most points.
(Hint: ?rank
)
# write your code here
The variables converted_x
and converted_y
give the location of the acting player on the court. Check with the
variable event_type
to see, for which types of play we have
geographic information.
For game number 21700003
plot the geographic location of
each play on the court, colour by team and incorporate visually the
play’s outcome (variable result
).
# write your code here
Is there any evidence that shots closer to the basket are successful
more often? For that, - introduce a new variable d
into the
data that captures the distance of a player from the basket (basket is
in [0,0] for variables original_x
and
original_y
). - draw side-by-side boxplots of distance by
result and team (using ggplot2). Interpret. (Hint: you can put either
result
or team
on the x-axis and the remaining
variable into facet)
# write your code here
The variable elapsed
is recorded in hour-minute-second
format. Convert the information into seconds (hint: you could introduce
helper variables for the conversion). The elapsed
time
variable starts over in each period (a period has 12 mins). Introduce a
new variable called time_played
that keeps track of the
time (in seconds) from the beginning of a game to the end.
Plot the timeline (time_played
) of events
(event_type
) for game with the id 21700002
in
a scatterplot. Colour by team. Make Facet by period
.
Comment on the result.
# write your code here
Variables h1
through h5
and a1
through a5
are the five players of the home team and the
away team on the field at that moment in positions 1 through 5. Reshape
the data set to combine all player variables. Extract position numbers.
For each game identify how many players were playing in each
position.
Draw side-by-side boxplots of the number of players by position. Comment.
# write your code here