DS 2020 - lab #4: Scraping (into) the Hall of Fame

class: center, middle, inverse, title-slide

.title[
# DS 2020 - lab #4: Scraping (into) the Hall of Fame
]
.author[
### Will Ju
]

---

class: center, bottom, inverse
background-image: url(images/hof.jpg)

# Lab #4: Scraping (into) the Hall of Fame
---

# Big Picture Goal

We are going to contribute an extension of the Hall of Fame data in the Lahman data package for the year 2025.

# Step-by-Step

In this activity we are going to

1. identify suitable websites with data on Baseball's Hall of Fame
2. write web scrapers with the goal to 
    
    a. automate the download process, 
    b. and - as much as possible - clean the data automatically, or at least identify potential problems, to, finally, 
    c. use the scrapers in future years
    
3. use the scraper to get data for the year 2025.
4. document the process.

The deliverables are (1) individual reports (Rmd) and (2) the dataset.

---

# Getting Ready

1. Identify your team! Go to Canvas and find out which team you are in for Lab 4.

2. Find the other members of your team and sit with them.

3. Introduce yourself to each other.

4. Go to https://ds202-at-isu.github.io/labs/lab04.html and follow the instructions.

---

# Step-by Step

1. Accept the link to Github Classroom shared in the announcement/chat. 
  
  - This link will ask you to log in to github. Select your name from the list by clicking on it. 
  
  - Check if your team number already exists - if it does, join the team with the right number. If it doesn't exist yet, create it yourself.

2. Start a new RStudio project on your local machine using the link to the github repository. (Connect your local R project to your Github repo)

3. For each individual, create a new RMarkdown file called `progress-report-<your github handle>.Rmd` (Mine would be called `progress-report-willju-wangqian.Rmd`). Delete everything from line 12 on.

4. Save the file and add it to your github repository.

5. Use your created RMarkdown file for your lab notes - every individual should try to scrape the dataset. Keep track of what you are doing in your own Rmd file, so you have an easier time afterward to see what worked and what didn't.

4. Commit the file and push. We are ready to roll!

---

# Data Background

The Lahman data package is based on [Sean Lahman](https://www.seanlahman.com/)'s  Baseball  [Database](https://www.seanlahman.com/baseball-archive/statistics/).

The `HallOfFame` table is a part of this set of tables and has data up to 2024. We have tried to scrape the 2023 data during our lectures. Now, we are aiming for the 2025 data.

## Baseball Reference

The site baseball-reference.com has grown out of Sean Lahman's work and is now maintained independently.

Incidentally, it also has tables with Hall of Fame information, e.g. for 2025:

https://www.baseball-reference.com/awards/hof_2025.shtml

---
# Scrape the data

Use the `rvest` package to download and parse data tables for Hall of Fame voting records.

# Clean the data

What steps are necessary to get the scraped data into the shape as the `HallOfFame` data table:

```r
library(Lahman)
head(HallOfFame, 3)
```

```
##    playerID yearID votedBy ballots needed votes inducted category needed_note
## 1 aaronha01   1982   BBWAA     415    312   406        Y   Player        <NA>
## 2 abbotji01   2005   BBWAA     516    387    13        N   Player        <NA>
## 3 abreubo01   2020   BBWAA     397    298    22        N   Player        <NA>
```
---

# Deliverable: data

As a team, create the Hall of fame data for the year 2025. Each of you might result in a dataset. Compared your results in the team and select one as your final result.

Save this final result by appending the new data frame(s) (your cleaned scraped data) to the existing data `HallOfFame` in `Lahman` package.

```r
HallOfFame %>% 
  ggplot(aes(x = yearID, fill = inducted)) +
  geom_bar() +
  xlim(c(1936, 2024))
```

![](lab04_files/figure-html/unnamed-chunk-2-1.png)

---

# Submission

1. Push changes to your file `progress-report-<github handle>.Rmd` to the github repo.

2. Save the expanded `HallOfFame` as a csv file `HallOfFame.csv`. Push `HallOfFame.csv` to the team's repository.

---

# Some Data Tidying tricks

```r
library(rvest)
url <- "https://www.baseball-reference.com/awards/hof_2025.shtml"
html <- read_html(url)
tables <- html_table(html)
```

Should you be in the situation, that a data set does not have any names, but the names are stored as the first line of records:

```r
head(tables[[1]], 3)
```

```
## # A tibble: 3 × 39
##   ``    ``           ``    ``    ``    ``    ``    ``    ``    ``    ``    ``   
##   <chr> <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Rk    Name         YoB   Votes %vote HOFm  HOFs  Yrs   WAR   WAR7  JAWS  Jpos 
## 2 1     Ichiro Suzu… 1st   393   99.7% 235   44    19    60.0  43.7  51.8  56.0 
## 3 2     CC Sabathia  1st   342   86.8% 128   48    19    62.3  39.4  50.8  61.3 
## # ℹ 27 more variables: `Batting Stats` <chr>, `Batting Stats` <chr>,
## #   `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>,
## #   `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>,
## #   `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>,
## #   `Batting Stats` <chr>, `Batting Stats` <chr>, `Pitching Stats` <chr>,
## #   `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>,
## #   `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, …
```
---

# Variable Names in Line 1

Obtain the column names from the first line as a variable. Overwrite the column names with the actual column names. Then delete the first line.

```r
data <- tables[[1]]
actual_col_names <- data[1, ]
colnames(data) <- actual_col_names
data <- data[-1, ]
head(data, 3)
```

---

# Check variable types

The code below is just an example. Make sure that all numeric variables are indeed numeric.

```r
data$Votes <- as.numeric(data$Votes)
```

---

# Functions you might need

**`parse_number`** from the `readr` package

```r
readr::parse_number(c("34%", "10th", "1.0"))
```

```
## [1] 34 10  1
```

**`gsub`** from R base:

Usage `gsub(pattern, replacement, x)`:  replace all occurrences of `pattern` in vector `x` by the string `replacement`.

```r
x <- c("David Ortiz", "X-Barry Bonds", "X-Roger Clemens")

gsub("X-", "Oh no! ", x)
```

```
## [1] "David Ortiz"          "Oh no! Barry Bonds"   "Oh no! Roger Clemens"
```

```r
gsub("X-", "", x)
```

```
## [1] "David Ortiz"   "Barry Bonds"   "Roger Clemens"
```
---

# Combining Data sets

If two data frames have the same variable names, we can use the command **`rbind`** (row bind) to concatenate them.

```r
x1 <- data.frame(id=1:2, name=c("A", "B"))
x2 <- data.frame(id=3:4, name=c("C", "D"))

rbind(x1, x2)
```

```
##   id name
## 1  1    A
## 2  2    B
## 3  3    C
## 4  4    D
```

```r
dframe <- rbind(x1, x2)
```

Don't forget to save the result!

---

# Exporting csv files

**`write.csv`** writes a data frame into a comma-separated values file (extension csv):

```r
write.csv(dframe, file="some-file.csv", row.names = FALSE)
```

Make sure to not export the row names, otherwise each successive read & write of the file adds another column in the front.

**`write_csv`** is not part of base, but faster, and does not convert special characters into `.`

```r
readr::write_csv(dframe, file="some-other-file.csv")
```

---

Due date: You have time until Monday, Apr 28 at 11:59 pm to submit the final RMmarkdown file.

One team member: upload the team's repo link to Canvas (just to signal to the instructor that you are done)