class: center, middle, inverse, title-slide .title[ # Functions in R and your first scraper ] .author[ ### Will Ju ] --- class: middle, inverse, center # Scraping with CSS --- # Scrape the data Use the `rvest` package to download and parse data tables for Hall of Fame voting records. ```r url <- "https://www.baseball-reference.com/awards/hof_2025.shtml" library(rvest) site <- read_html(url) ``` The command `html_element` allows us to select based on css selectors (www3 school CSS)[https://www.w3schools.com/CSSref/css_selectors.php] or (CSS Diner)[https://flukeout.github.io/] Load the baseball reference website in Chrome. Then use View > Developer > Inspect Elements. What id should we use? --- # BBWAA Table table has id `hof_BBWAA`: ```r site %>% html_element(css="#hof_BBWAA") %>% html_table() %>% head(3) ``` ``` ## # A tibble: 3 × 39 ## `` `` `` `` `` `` `` `` `` `` `` `` ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Rk Name YoB Votes %vote HOFm HOFs Yrs WAR WAR7 JAWS Jpos ## 2 1 Ichiro Suzu… 1st 393 99.7% 235 44 19 60.0 43.7 51.8 56.0 ## 3 2 CC Sabathia 1st 342 86.8% 128 48 19 62.3 39.4 50.8 61.3 ## # ℹ 27 more variables: `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, … ``` --- # Reset the column names from the first row ```r bbwaa <- site %>% html_element(css="#hof_BBWAA") %>% html_table() colnames(bbwaa) <- bbwaa[1,] bbwaa <- bbwaa[-1,] head(bbwaa) ``` ``` ## # A tibble: 6 × 39 ## Rk Name YoB Votes `%vote` HOFm HOFs Yrs WAR WAR7 JAWS Jpos ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1 Ichiro Su… 1st 393 99.7% 235 44 19 60.0 43.7 51.8 56.0 ## 2 2 CC Sabath… 1st 342 86.8% 128 48 19 62.3 39.4 50.8 61.3 ## 3 3 Billy Wag… 10th 325 82.5% 107 24 16 27.7 19.8 23.7 31.6 ## 4 4 Carlos Be… 3rd 277 70.3% 126 52 20 70.0 44.4 57.2 58.0 ## 5 5 Andruw Jo… 8th 261 66.2% 109 34 17 62.7 46.4 54.6 58.0 ## 6 6 Chase Utl… 2nd 157 39.8% 94 36 16 64.6 49.3 57.0 56.9 ## # ℹ 27 more variables: G <chr>, AB <chr>, R <chr>, H <chr>, HR <chr>, ## # RBI <chr>, SB <chr>, BB <chr>, BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, ## # `OPS+` <chr>, W <chr>, L <chr>, ERA <chr>, `ERA+` <chr>, WHIP <chr>, ## # G <chr>, GS <chr>, SV <chr>, IP <chr>, H <chr>, HR <chr>, BB <chr>, ## # SO <chr>, `Pos Summary` <chr> ``` --- class: middle, inverse, center # Functions in R --- # Functions in R - Have been using functions a lot, now we want to write them ourselves! - Idea: avoid repetitive coding (errors will creep in) - Instead: extract common core, wrap it in a function, make it reusable --- # Structure of functions - Name - Input arguments - names, - default values - Body - Output values --- # A first function ```r mymean <- function(x) { return(sum(x)/length(x)) } ``` ```r mymean(1:15) ``` ``` ## [1] 8 ``` ```r mymean(c(1:15, NA)) ``` ``` ## [1] NA ``` --- # A first function (2) ```r mymean <- function(x, na.rm=F) { if (na.rm) x <- na.omit(x) return(sum(x)/length(x)) } mymean(1:15) ``` ``` ## [1] 8 ``` ```r mymean(c(1:15, NA), na.rm=T) ``` ``` ## [1] 8 ``` --- class: inverse # Your Turn: a scraper The package `rvest` allows us to download data from the baseball reference website `url` using the following lines of code: ```r library(rvest) site <- read_html(url) bbwaa <- site %>% html_element("#hof_BBWAA") %>% html_table() ``` Write a function that uses the url as input argument, scrapes the data and returns it Try out your function on the site https://www.baseball-reference.com/awards/hof_2024.shtml --- # Your turn - solution ```r library(rvest) bbwaa_scraper <- function(url) { site <- read_html(url) bbwaa <- site %>% html_element("#hof_BBWAA") %>% html_table() bbwaa } bbwaa_scraper("https://www.baseball-reference.com/awards/hof_2024.shtml") ``` ``` ## # A tibble: 27 × 39 ## `` `` `` `` `` `` `` `` `` `` `` `` ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Rk Name YoB Votes %vote HOFm HOFs Yrs WAR WAR7 JAWS Jpos ## 2 1 Adrian Bel… 1st 366 95.1% 163 55 21 93.7 48.9 71.3 56.1 ## 3 2 Todd Helton 6th 307 79.7% 175 59 17 61.8 46.6 54.2 53.5 ## 4 3 Joe Mauer 1st 293 76.1% 92 41 15 55.6 39.1 47.4 44.3 ## 5 4 Billy Wagn… 9th 284 73.8% 107 24 16 27.7 19.8 23.7 31.6 ## 6 5 X-Gary She… 10th 246 63.9% 158 61 22 60.5 38.0 49.3 56.0 ## 7 6 Andruw Jon… 7th 237 61.6% 109 34 17 62.7 46.4 54.6 58.0 ## 8 7 Carlos Bel… 2nd 220 57.1% 126 52 20 70.0 44.4 57.2 58.0 ## 9 8 Alex Rodri… 3rd 134 34.8% 390 77 22 117.4 64.3 90.8 55.4 ## 10 9 Manny Rami… 8th 125 32.5% 226 69 19 69.3 39.9 54.6 53.5 ## # ℹ 17 more rows ## # ℹ 27 more variables: `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Batting Stats` <chr>, ## # `Batting Stats` <chr>, `Batting Stats` <chr>, `Pitching Stats` <chr>, ## # `Pitching Stats` <chr>, `Pitching Stats` <chr>, `Pitching Stats` <chr>, … ``` --- class: inverse # Your Turn: expanding Expand your function by a parameter `element` that enables you to download different pieces from a website. Set `element` to "#hof_Veterans" and try out your function for the year 2025. <!-- --- --> <!-- class: inverse --> <!-- # Your Turn: a helper function --> <!-- Write a helper function `dots_to_spaces` that takes as input a vector of characters (text), and returns as output the same vector in which all occurrences of '.' are replaced and all double spaces are reduced to one. --> <!-- ```{r, eval = FALSE} --> <!-- dots_to_spaces <- function(x) { --> <!-- # body of the function --> <!-- # return cleaned up vector x --> <!-- } --> <!-- ``` --> --- class: inverse, center, middle # Always scrape data responsibly!