class: center, middle, inverse, title-slide .title[ # Functions in R and your first scraper ] .author[ ### Will Ju ] --- class: middle, inverse, center # Scraping with CSS --- # Scrape the data Use the `rvest` package to download and parse data tables for Hall of Fame voting records. ```r url <- "https://www.baseball-reference.com/awards/hof_2022.shtml" library(rvest) site <- read_html(url) ``` The command `html_element` allows us to select based on css selectors (www3 school CSS)[https://www.w3schools.com/CSSref/css_selectors.php] or (CSS Diner)[https://flukeout.github.io/] Load the baseball reference website in Chrome. Then use View > Developer > Inspect Elements. What id should we use? --- # BBWAA Table table has id `hof_BBWAA`: ```r site %>% html_element(css="#hof_BBWAA") %>% html_table() %>% head(3) ``` ``` ## # A tibble: 3 × 39 ## `` `` `` `` `` `` `` `` `` `` `` `` ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Rk Name YoB Votes %vote HOFm HOFs Yrs WAR WAR7 JAWS Jpos ## 2 1 David Ortiz 1st 307 77.9% 171 55 20 55.3 35.2 45.3 53.4 ## 3 2 X-Barry Bon… 10th 260 66.0% 340 77 22 162.8 72.7 117.8 53.4 ## # ℹ 27 more variables: `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, … ``` Solve the problem with the first row by writing the table into a temporary file and reading it back in. --- # Write the table, read the table ```r bbwaa <- site %>% html_element(css="#hof_BBWAA") %>% html_table() readr::write_csv(bbwaa, "temp.csv") bbwaa <- readr::read_csv("temp.csv", skip = 1, show_col_types = FALSE) ``` ``` ## New names: ## • `G` -> `G...13` ## • `H` -> `H...16` ## • `HR` -> `HR...17` ## • `BB` -> `BB...20` ## • `G` -> `G...31` ## • `H` -> `H...35` ## • `HR` -> `HR...36` ## • `BB` -> `BB...37` ``` ```r head(bbwaa) ``` ``` ## # A tibble: 6 × 39 ## Rk Name YoB Votes `%vote` HOFm HOFs Yrs WAR WAR7 JAWS Jpos ## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 David Ort… 1st 307 77.9% 171 55 20 55.3 35.2 45.3 53.4 ## 2 2 X-Barry B… 10th 260 66.0% 340 77 22 163. 72.7 118. 53.4 ## 3 3 X-Roger C… 10th 257 65.2% 332 73 24 139. 65.9 103. 61.5 ## 4 4 Scott Rol… 5th 249 63.2% 99 40 17 70.1 43.6 56.9 56.3 ## 5 5 X-Curt Sc… 10th 231 58.6% 171 46 20 79.5 48.6 64.1 61.5 ## 6 6 Todd Helt… 4th 205 52.0% 175 59 17 61.8 46.6 54.2 53.4 ## # ℹ 27 more variables: G...13 <dbl>, AB <dbl>, R <dbl>, H...16 <dbl>, ## # HR...17 <dbl>, RBI <dbl>, SB <dbl>, BB...20 <dbl>, BA <dbl>, OBP <dbl>, ## # SLG <dbl>, OPS <dbl>, `OPS+` <dbl>, W <dbl>, L <dbl>, ERA <dbl>, ## # `ERA+` <dbl>, WHIP <dbl>, G...31 <dbl>, GS <dbl>, SV <dbl>, IP <dbl>, ## # H...35 <dbl>, HR...36 <dbl>, BB...37 <dbl>, SO <dbl>, `Pos Summary` <chr> ``` --- class: middle, inverse, center # Functions in R --- # Functions in R - Have been using functions a lot, now we want to write them ourselves! - Idea: avoid repetitive coding (errors will creep in) - Instead: extract common core, wrap it in a function, make it reusable --- # Structure of functions - Name - Input arguments - names, - default values - Body - Output values --- # A first function ```r mymean <- function(x) { return(sum(x)/length(x)) } ``` ```r mymean(1:15) ``` ``` ## [1] 8 ``` ```r mymean(c(1:15, NA)) ``` ``` ## [1] NA ``` --- # A first function (2) ```r mymean <- function(x, na.rm=F) { if (na.rm) x <- na.omit(x) return(sum(x)/length(x)) } mymean(1:15) ``` ``` ## [1] 8 ``` ```r mymean(c(1:15, NA), na.rm=T) ``` ``` ## [1] 8 ``` --- class: inverse # Your Turn: a scraper The package `rvest` allows us to download data from the baseball reference website `url` using the following lines of code: ```r library(rvest) site <- read_html(url) bbwaa <- site %>% html_element("#hof_BBWAA") %>% html_table() ``` Write a function that uses the url as input argument, scrapes the data and returns it Try out your function on the site https://www.baseball-reference.com/awards/hof_2021.shtml --- # Your turn - solution ```r library(rvest) bbwaa_scraper <- function(url) { site <- read_html(url) bbwaa <- site %>% html_element("#hof_BBWAA") %>% html_table() bbwaa } bbwaa_scraper("https://www.baseball-reference.com/awards/hof_2021.shtml") ``` ``` ## # A tibble: 26 × 39 ## `` `` `` `` `` `` `` `` `` `` `` `` ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Rk Name YoB Votes %vote HOFm HOFs Yrs WAR WAR7 JAWS Jpos ## 2 1 Curt Schil… 9th 285 71.1% 171 46 20 79.5 48.6 64.1 61.5 ## 3 2 Barry Bonds 9th 248 61.8% 340 77 22 162.8 72.7 117.8 53.4 ## 4 3 Roger Clem… 9th 247 61.6% 332 73 24 139.2 65.9 102.6 61.5 ## 5 4 Scott Role… 4th 212 52.9% 99 40 17 70.1 43.6 56.9 56.3 ## 6 5 Omar Vizqu… 4th 197 49.1% 120 42 24 45.6 26.8 36.2 55.5 ## 7 6 Billy Wagn… 6th 186 46.4% 107 24 16 27.7 19.8 23.7 32.5 ## 8 7 Todd Helto… 3rd 180 44.9% 175 59 17 61.8 46.6 54.2 53.4 ## 9 8 Gary Sheff… 7th 163 40.6% 158 61 22 60.5 38.0 49.3 56.7 ## 10 9 Andruw Jon… 4th 136 33.9% 109 34 17 62.7 46.4 54.6 58.2 ## # ℹ 16 more rows ## # ℹ 27 more variables: `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, … ``` --- class: inverse # Your Turn: expanding Expand your function by a parameter `element` that enables you to download different pieces from a website. Set `element` to "#hof_Veterans" and try out your function for the year 2022. <!-- --- --> <!-- class: inverse --> <!-- # Your Turn: a helper function --> <!-- Write a helper function `dots_to_spaces` that takes as input a vector of characters (text), and returns as output the same vector in which all occurrences of '.' are replaced and all double spaces are reduced to one. --> <!-- ```{r, eval = FALSE} --> <!-- dots_to_spaces <- function(x) { --> <!-- # body of the function --> <!-- # return cleaned up vector x --> <!-- } --> <!-- ``` --> --- class: inverse, center, middle # Always scrape data responsibly!