In this lab, you will:
HallOfFame
table
from the Lahman
package.csv
fileAll work should be done in this .Rmd
file. Submit the
.Rmd
and your knitted .html
(or
.docx
) to Canvas.
Use the rvest
package to read the table of Hall of Fame
voting results for 2025 from:
https://www.baseball-reference.com/awards/hof_2025.shtml
Scrape the table and store it in a data frame called
hall2025
.
# Your code here
Perform the following steps:
If needed, extract column names from the first row
Use parse_number()
to convert %
, vote
counts, and ranks to numeric
Remove any characters (like %
or th
)
with gsub()
or parse_number()
Create a cleaned data frame that contains the following 9
variables, matching the Lahman HallOfFame
table:
playerID
: set to NA
for now unless you can
match player names manuallyyearID
: set to 2025
votedBy
: set to "BBWAA"
ballots
: total number of ballots (should be same for
all rows)needed
: number of votes needed for inductionvotes
: number of votes the player receivedinducted
: "Y"
for inducted,
"N"
otherwisecategory
: set to "Player"
needed_note
: set to NA
You can use head(HallOfFame)
to inspect the
structure.
# Your cleaning code here
HallOfFame
dataBind your cleaned table (hall2025_clean
) to the
HallOfFame
data using bind_rows()
or
rbind()
. Save the result to a new data frame called
final_data
.
# Your combining code here
Save your combined data frame to a file named
HallOfFame.csv
in your working directory.
# Your saving code here
.Rmd
file and the knitted
.html
or .docx
file to CanvasThis homework builds on the lab by introducing data licensing and ethical considerations in web scraping.
Under what license is the R package ggplot2
published? What does that mean for use of its built-in
diamonds
dataset?
Find two different versions of a “diamonds” dataset on Kaggle. For each:
ggplot2::diamonds
. Are they the
same or different?# Dataset 1:
# URL:
# License:
# Comparison code:
# Dataset 2:
# URL:
# License:
# Comparison code:
What license governs the Iowa Liquor Sales data? What does this allow you to do with the data?
Why do you think there are so many datasets on Kaggle that resemble one another without attribution?
If you’re feeling ambitious, try any of the following:
Write a function data_link_scraper()
that extracts
dataset links from a Kaggle search
(e.g. https://www.kaggle.com/search?q=diamonds
)
Write a function kaggle_evaluate()
that takes a
dataset link and extracts metadata like license, author, and file
size.
Write a function same_data()
to test whether a
Kaggle dataset is equivalent to ggplot2::diamonds
.
Note: Your submission is supposed to be fully reproducible, i.e. the TA and I will ‘knit’ your submission in RStudio.
For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html (or Word) file with it.
(Optional but encouraged):
If you’d like to practice using GitHub, feel free to push your
.Rmd
and knitted .html
file to a
public GitHub repository under your own account. If you
do, paste the link to your GitHub repo below:
GitHub repo link (optional):
__________