class: center, middle, inverse, title-slide .title[ # DS 2020 - lab #4: Scraping (into) the Hall of Fame ] .author[ ### Will Ju ] --- class: center, bottom, inverse background-image: url(images/hof.jpg) <!--<img src="" class="cover" height=1500>--> # Lab #4: Scraping (into) the Hall of Fame --- # Big Picture Goal We are going to contribute an extension of the Hall of Fame data in the Lahman data package for the year 2025. # Step-by-Step In this activity we are going to 1. identify suitable websites with data on Baseball's Hall of Fame 2. write web scrapers with the goal to a. automate the download process, b. and - as much as possible - clean the data automatically, or at least identify potential problems, to, finally, c. use the scrapers in future years 3. use the scraper to get data for the year 2025. 4. document the process. The deliverables are (1) individual reports (Rmd) and (2) the dataset. --- # Getting Ready 1. Identify your team! Go to Canvas and find out which team you are in for Lab 4. 2. Find the other members of your team and sit with them. 3. Introduce yourself to each other. 4. Go to https://ds202-at-isu.github.io/labs/lab04.html and follow the instructions. --- # Step-by Step 1. Accept the link to Github Classroom shared in the announcement/chat. - This link will ask you to log in to github. Select your name from the list by clicking on it. - Check if your team number already exists - if it does, join the team with the right number. If it doesn't exist yet, create it yourself. 2. Start a new RStudio project on your local machine using the link to the github repository. (Connect your local R project to your Github repo) 3. For each individual, create a new RMarkdown file called `progress-report-<your github handle>.Rmd` (Mine would be called `progress-report-willju-wangqian.Rmd`). Delete everything from line 12 on. 4. Save the file and add it to your github repository. 5. Use your created RMarkdown file for your lab notes - every individual should try to scrape the dataset. Keep track of what you are doing in your own Rmd file, so you have an easier time afterward to see what worked and what didn't. 4. Commit the file and push. We are ready to roll! --- # Data Background The Lahman data package is based on [Sean Lahman](https://www.seanlahman.com/)'s Baseball [Database](https://www.seanlahman.com/baseball-archive/statistics/). The `HallOfFame` table is a part of this set of tables and has data up to 2024. We have tried to scrape the 2023 data during our lectures. Now, we are aiming for the 2025 data. ## Baseball Reference The site baseball-reference.com has grown out of Sean Lahman's work and is now maintained independently. Incidentally, it also has tables with Hall of Fame information, e.g. for 2025: https://www.baseball-reference.com/awards/hof_2025.shtml --- # Scrape the data Use the `rvest` package to download and parse data tables for Hall of Fame voting records. # Clean the data What steps are necessary to get the scraped data into the shape as the `HallOfFame` data table: ```r library(Lahman) head(HallOfFame, 3) ``` ``` ## playerID yearID votedBy ballots needed votes inducted category needed_note ## 1 aaronha01 1982 BBWAA 415 312 406 Y Player <NA> ## 2 abbotji01 2005 BBWAA 516 387 13 N Player <NA> ## 3 abreubo01 2020 BBWAA 397 298 22 N Player <NA> ``` --- # Deliverable: data As a team, create the Hall of fame data for the year 2025. Each of you might result in a dataset. Compared your results in the team and select one as your final result. Save this final result by appending the new data frame(s) (your cleaned scraped data) to the existing data `HallOfFame` in `Lahman` package. ```r HallOfFame %>% ggplot(aes(x = yearID, fill = inducted)) + geom_bar() + xlim(c(1936, 2024)) ``` <!-- --> --- # Submission 1. Push changes to your file `progress-report-<github handle>.Rmd` to the github repo. 2. Save the expanded `HallOfFame` as a csv file `HallOfFame.csv`. Push `HallOfFame.csv` to the team's repository. --- # Some Data Tidying tricks ```r library(rvest) url <- "https://www.baseball-reference.com/awards/hof_2025.shtml" html <- read_html(url) tables <- html_table(html) ``` Should you be in the situation, that a data set does not have any names, but the names are stored as the first line of records: ```r head(tables[[1]], 3) ``` ``` ## # A tibble: 3 × 39 ## `` `` `` `` `` `` `` `` `` `` `` `` ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Rk Name YoB Votes %vote HOFm HOFs Yrs WAR WAR7 JAWS Jpos ## 2 1 Ichiro Suzu… 1st 393 99.7% 235 44 19 60.0 43.7 51.8 56.0 ## 3 2 CC Sabathia 1st 342 86.8% 128 48 19 62.3 39.4 50.8 61.3 ## # ℹ 27 more variables: `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, … ``` --- # Variable Names in Line 1 <!-- Write the dataset into a temporary file, and read the data back in (using the command `read_csv`) and skipping the first line: --> Obtain the column names from the first line as a variable. Overwrite the column names with the actual column names. Then delete the first line. ```r data <- tables[[1]] actual_col_names <- data[1, ] colnames(data) <- actual_col_names data <- data[-1, ] head(data, 3) ``` --- # Check variable types The code below is just an example. Make sure that all numeric variables are indeed numeric. ```r data$Votes <- as.numeric(data$Votes) ``` --- # Functions you might need **`parse_number`** from the `readr` package ```r readr::parse_number(c("34%", "10th", "1.0")) ``` ``` ## [1] 34 10 1 ``` **`gsub`** from R base: Usage `gsub(pattern, replacement, x)`: replace all occurrences of `pattern` in vector `x` by the string `replacement`. ```r x <- c("David Ortiz", "X-Barry Bonds", "X-Roger Clemens") gsub("X-", "Oh no! ", x) ``` ``` ## [1] "David Ortiz" "Oh no! Barry Bonds" "Oh no! Roger Clemens" ``` ```r gsub("X-", "", x) ``` ``` ## [1] "David Ortiz" "Barry Bonds" "Roger Clemens" ``` --- # Combining Data sets If two data frames have the same variable names, we can use the command **`rbind`** (row bind) to concatenate them. ```r x1 <- data.frame(id=1:2, name=c("A", "B")) x2 <- data.frame(id=3:4, name=c("C", "D")) rbind(x1, x2) ``` ``` ## id name ## 1 1 A ## 2 2 B ## 3 3 C ## 4 4 D ``` ```r dframe <- rbind(x1, x2) ``` Don't forget to save the result! --- # Exporting csv files **`write.csv`** writes a data frame into a comma-separated values file (extension csv): ```r write.csv(dframe, file="some-file.csv", row.names = FALSE) ``` Make sure to not export the row names, otherwise each successive read & write of the file adds another column in the front. **`write_csv`** is not part of base, but faster, and does not convert special characters into `.` ```r readr::write_csv(dframe, file="some-other-file.csv") ``` --- Due date: You have time until Monday, Apr 28 at 11:59 pm to submit the final RMmarkdown file. One team member: upload the team's repo link to Canvas (just to signal to the instructor that you are done)