class: center, middle, inverse, title-slide .title[ # Web Scraping ] .author[ ### DS 2020 ] --- ## Web Scraping - Transform data from web pages into usable information - Automate the process ![](http://webdata-scraping.com/wp-content/uploads/2013/11/web-scraping-services.png) --- ## `rvest` + `xml2`: Easy Web Scraping - `read_html` gets the full set of HTML markup from a URL ```r library(rvest) url <- "https://en.wikipedia.org/wiki/2023_Baseball_Hall_of_Fame_balloting" html <- read_html(url) html ``` ``` ## {html_document} ## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text ... ## [2] <body class="skin--responsive skin-vector skin-vector ... ``` - Use `html_attr`, `html_node`, `html_table`, and `html_text` to extract useful information from the markup --- ## Get a *table* from an online source `html_table` extracts all tables from the sourced html into a list of data frames: ```r tables <- html %>% html_table(fill=TRUE) # tables %>% str() ``` --- ## Data Munging Most tables need a bit of clean-up: ```r bbwaa <- tables[[3]] # candidates on the BBWAA Ballot vet1 <- tables[[5]] # Early Baseball Era Committee vet2 <- tables[[6]] # Golden Days Era Committee bbwaa %>% head() ``` ``` ## # A tibble: 6 × 5 ## Player Votes Percent Change Year ## <chr> <int> <chr> <chr> <chr> ## 1 Scott Rolen 297 76.3% 013.1% 6th ## 2 Todd Helton 281 72.2% 020.2% 5th ## 3 Billy Wagner 265 68.1% 017.1% 8th ## 4 Andruw Jones 226 58.1% 016.7% 6th ## 5 Gary Sheffield 214 55.0% 014.4% 9th ## 6 Carlos Beltrán† 181 46.5% – 1st ``` --- class: inverse # Your Turn Go to the site https://bbwaa.com/23-hof/#votingtable Read all tables from this website Which source should we use? --- # Your Turn - Reading Data ```r hof <- "https://bbwaa.com/23-hof/#votingtable" html <- read_html(hof) hof_tbl <- html %>% html_table() bbwaa <- hof_tbl[[1]] names(bbwaa)[1] <- "First Lastname" head(bbwaa) ``` ``` ## # A tibble: 6 × 4 ## `First Lastname` Votes Percent `Years on ballot` ## <chr> <int> <dbl> <int> ## 1 Scott Rolen 297 76.3 6 ## 2 Todd Helton 281 72.2 5 ## 3 Billy Wagner 265 68.1 8 ## 4 Andruw Jones 226 58.1 6 ## 5 Gary Sheffield 214 55 9 ## 6 Carlos Beltran 181 46.5 1 ``` --- # Your Turn The `HallOfFame` dataset in the Lahman package has slightly different variables, as shown below. How would you go about determining these variables for the `bbwaa` data? ```r library(Lahman) head(HallOfFame,2) ``` ``` ## playerID yearID votedBy ballots needed votes inducted ## 1 cobbty01 1936 BBWAA 226 170 222 Y ## 2 ruthba01 1936 BBWAA 226 170 215 Y ## category needed_note ## 1 Player <NA> ## 2 Player <NA> ``` --- # Your Turn - Creating new variables From https://bbwaa.com/23-hof/ : 389 ballots cast in 2023, 292 needed for induction ```r bbwaa <- bbwaa %>% mutate( yearID = 2023, votedBy = "BBWAA", ballots = 389, needed = 292, inducted = ifelse(Votes>=292, "Y", "N"), category = NA, # don't know yet needed_note = NA # not sure what would go here ) %>% rename( votes = Votes ) %>% select(-Percent, -`Years on ballot`) ``` --- # Data Munging The `People` data frame has `playerID` and players' names We could try to create a (temporary) variable in `People` called `First Lastname` that consists of `nameFirst` and `nameLast`: ```r People <- People %>% mutate( `First Lastname`=paste(`nameFirst`, `nameLast`) ) ``` --- class:inverse ## Your Turn Use the expanded version of `People` to merge the playerID info into the `bbwaa` dataset. How many playerIDs are we missing? Get the list of names that we can not match. Is there a possible reason that we can work around? --- # Your Turn - Identifying Problems ```r bbwaa %>% anti_join( People %>% select(`First Lastname`, playerID), by="First Lastname") ``` ``` ## # A tibble: 2 × 9 ## `First Lastname` votes yearID votedBy ballots needed ## <chr> <int> <dbl> <chr> <dbl> <dbl> ## 1 R.A. Dickey 1 2023 BBWAA 389 292 ## 2 J.J. Hardy 0 2023 BBWAA 389 292 ## # ℹ 3 more variables: inducted <chr>, category <lgl>, ## # needed_note <lgl> ``` ```r People %>% filter(nameLast %in% c("Dickey", "Hardy")) %>% select(playerID, nameFirst, nameLast) ``` ``` ## playerID nameFirst nameLast ## 1 dickebi01 Bill Dickey ## 2 dickege02 George Dickey ## 3 dickera01 R. A. Dickey ## 4 hardyal01 Alex Hardy ## 5 hardybl01 Blaine Hardy ## 6 hardyca01 Carroll Hardy ## 7 hardyha01 Harry Hardy ## 8 hardyja01 Jack Hardy ## 9 hardyja02 Jack Hardy ## 10 hardyjj01 J. J. Hardy ## 11 hardyla01 Larry Hardy ## 12 hardyre01 Red Hardy ``` --- # Solving Problems New idea: get rid of any white spaces after . in the first name before creating variable `First Lastname` ```r People <- People %>% mutate( `First Lastname` = paste( str_replace(nameFirst,"\\. ", "."), # this uses a regular expression nameLast) ) People %>% filter(nameLast %in% c("Dickey", "Hardy")) %>% select(playerID, `First Lastname`) ``` ``` ## playerID First Lastname ## 1 dickebi01 Bill Dickey ## 2 dickege02 George Dickey ## 3 dickera01 R.A. Dickey ## 4 hardyal01 Alex Hardy ## 5 hardybl01 Blaine Hardy ## 6 hardyca01 Carroll Hardy ## 7 hardyha01 Harry Hardy ## 8 hardyja01 Jack Hardy ## 9 hardyja02 Jack Hardy ## 10 hardyjj01 J.J. Hardy ## 11 hardyla01 Larry Hardy ## 12 hardyre01 Red Hardy ``` ```r bbwaa %>% anti_join( People %>% select(`First Lastname`, playerID), by="First Lastname") %>% nrow() # no problems anymore! ``` ``` ## [1] 0 ``` --- class:inverse ## Your Turn The code below merges the playerID from the expanded People data into the scraped bbwaa results. ```r bbwaa <- bbwaa %>% left_join( People %>% select(`First Lastname`, playerID), by="First Lastname") ``` How could we get information on the category? --- ## Beyond tables Sometimes data on the web is not structured as nicely... e.g. let's assume we want to get a list of all recently active baseball players from [Baseball reference](http://www.baseball-reference.com/players/) .center[![:scale 80%](baseball_reference.png)] --- ## SelectorGadget - SelectorGadget is a javascript bookmarklet to determine the css selectors of pieces of a website we want to extract. - Bookmark the [SelectorGadget](https://selectorgadget.com/) link, then click on it to use it (or add the chrome extension) - When SelectorGadget is active, pieces of the website are highlighted in orange/green/red. - Use SelectorGadget on http://www.baseball-reference.com/players/ . - Read more details on `vignette("selectorgadget")` (or on the [rvest website](https://rvest.tidyverse.org/articles/selectorgadget.html)) If you prefer, you can also read the HTML code and create your own [CSS](https://www.w3schools.com/cssref/css_selectors.asp) or [XPATH](https://www.w3schools.com/xml/xpath_syntax.asp) selectors. --- ## SelectorGadget Result *Select all active baseball players with a last name starting with 'a'* ```r url <- "http://www.baseball-reference.com/players/a/" html <- read_html(url) html %>% html_elements("b") %>% html_text() ``` ``` ## [1] "Andrew Abbott (2023-2024)" ## [2] "Cory Abbott (2021-2023)" ## [3] "CJ Abrams (2022-2024)" ## [4] "Bryan Abreu (2019-2024)" ## [5] "José Abreu (2014-2024)" ## [6] "Wilyer Abreu (2023-2024)" ## [7] "Garrett Acton (2023-2023)" ## [8] "Luisangel Acuña (2024-2024)" ## [9] "Ronald Acuña Jr. (2018-2024)" ## [10] "Jason Adam (2018-2024)" ## [11] "Willy Adames (2018-2024)" ## [12] "Austin Adams (2017-2024)" ## [13] "Chance Adams (2018-2020)" ## [14] "Jordyn Adams (2023-2024)" ## [15] "Riley Adams (2021-2024)" ## [16] "Ty Adcock (2023-2024)" ## [17] "Jo Adell (2020-2024)" ## [18] "Joan Adon (2021-2024)" ## [19] "Ehire Adrianza (2013-2024)" ## [20] "Julian Aguiar (2024-2024)" ## [21] "Nick Ahmed (2014-2024)" ## [22] "Keegan Akin (2020-2024)" ## [23] "Ozzie Albies (2017-2024)" ## [24] "Jorge Alcalá (2019-2024)" ## [25] "Kevin Alcántara (2024-2024)" ## [26] "Sandy Alcántara (2017-2023)" ## [27] "Sergio Alcántara (2020-2022)" ## [28] "Sam Aldegheri (2024-2024)" ## [29] "Blaze Alexander (2024-2024)" ## [30] "CJ Alexander (2024-2024)" ## [31] "Jason Alexander (2022-2022)" ## [32] "Scott Alexander (2015-2024)" ## [33] "Tyler Alexander (2019-2024)" ## [34] "A.J. Alexy (2021-2022)" ## [35] "Anthony Alford (2017-2022)" ## [36] "Kolby Allard (2018-2024)" ## [37] "Cam Alldred (2022-2022)" ## [38] "Austin Allen (2019-2022)" ## [39] "Greg Allen (2017-2023)" ## [40] "Logan Allen (2019-2024)" ## [41] "Logan Allen (2023-2024)" ## [42] "Nick Allen (2022-2024)" ## [43] "Yency Almonte (2018-2024)" ## [44] "Albert Almora (2016-2022)" ## [45] "Pete Alonso (2019-2024)" ## [46] "Dan Altavilla (2016-2024)" ## [47] "Jose Altuve (2011-2024)" ## [48] "Jake Alu (2023-2023)" ## [49] "José Alvarado (2017-2024)" ## [50] "Armando Alvarez (2024-2024)" ## [51] "Eddy Alvarez (2020-2024)" ## [52] "Francisco Alvarez (2022-2024)" ## [53] "Nacho Alvarez Jr. (2024-2024)" ## [54] "Yordan Alvarez (2019-2024)" ## [55] "Adbert Alzolay (2019-2024)" ## [56] "Adael Amador (2024-2024)" ## [57] "Jacob Amaya (2023-2024)" ## [58] "Miguel Amaya (2023-2024)" ## [59] "Brian Anderson (2017-2024)" ## [60] "Chase Anderson (2014-2024)" ## [61] "Drew Anderson (2017-2021)" ## [62] "Grant Anderson (2023-2024)" ## [63] "Ian Anderson (2020-2022)" ## [64] "Justin Anderson (2018-2024)" ## [65] "Nick Anderson (2019-2024)" ## [66] "Shaun Anderson (2019-2024)" ## [67] "Tim Anderson (2016-2024)" ## [68] "Tyler Anderson (2016-2024)" ## [69] "Clayton Andrews (2023-2024)" ## [70] "Matt Andriese (2015-2024)" ## [71] "Miguel Andujar (2017-2024)" ## [72] "Tejay Antone (2020-2024)" ## [73] "Jonathan Aranda (2022-2024)" ## [74] "Jonathan Araúz (2020-2023)" ## [75] "Orlando Arcia (2016-2024)" ## [76] "Nolan Arenado (2013-2024)" ## [77] "Gabriel Arias (2022-2024)" ## [78] "Shawn Armstrong (2015-2024)" ## [79] "Randy Arozarena (2019-2024)" ## [80] "Luis Arráez (2019-2024)" ## [81] "Spencer Arrighetti (2024-2024)" ## [82] "Christian Arroyo (2017-2023)" ## [83] "Aaron Ashby (2021-2024)" ## [84] "Graham Ashcraft (2022-2024)" ## [85] "Javier Assad (2022-2024)" ## [86] "Nick Avila (2024-2024)" ## [87] "Pedro Avila (2019-2024)" ## [88] "José Azocar (2022-2024)" ``` ```r # html %>% html_elements("#div_players_ p") %>% html_text() ``` --- ## Example, varied We are, in fact, not just interested in the *names of the players*, but also in the *links* to each player's website - `html_attr` let's us access an attribute of an html node - `html_attrs` extracts all attributes of an html node ```r html %>% html_elements("b a") %>% html_attr(name="href") ``` ``` ## [1] "/players/a/abbotan01.shtml" ## [2] "/players/a/abbotco01.shtml" ## [3] "/players/a/abramcj01.shtml" ## [4] "/players/a/abreubr01.shtml" ## [5] "/players/a/abreujo02.shtml" ## [6] "/players/a/abreuwi02.shtml" ## [7] "/players/a/actonga01.shtml" ## [8] "/players/a/acunajo01.shtml" ## [9] "/players/a/acunaro01.shtml" ## [10] "/players/a/adamja01.shtml" ## [11] "/players/a/adamewi01.shtml" ## [12] "/players/a/adamsau02.shtml" ## [13] "/players/a/adamsch01.shtml" ## [14] "/players/a/adamsjo03.shtml" ## [15] "/players/a/adamsri03.shtml" ## [16] "/players/a/adcocty01.shtml" ## [17] "/players/a/adelljo01.shtml" ## [18] "/players/a/adonjo01.shtml" ## [19] "/players/a/adriaeh01.shtml" ## [20] "/players/a/aguiaju01.shtml" ## [21] "/players/a/ahmedni01.shtml" ## [22] "/players/a/akinke01.shtml" ## [23] "/players/a/albieoz01.shtml" ## [24] "/players/a/alcaljo01.shtml" ## [25] "/players/a/alcanke01.shtml" ## [26] "/players/a/alcansa01.shtml" ## [27] "/players/a/alcanse01.shtml" ## [28] "/players/a/aldegsa01.shtml" ## [29] "/players/a/alexabl01.shtml" ## [30] "/players/a/alexacj01.shtml" ## [31] "/players/a/alexaja01.shtml" ## [32] "/players/a/alexasc02.shtml" ## [33] "/players/a/alexaty01.shtml" ## [34] "/players/a/alexyaj01.shtml" ## [35] "/players/a/alforan01.shtml" ## [36] "/players/a/allarko01.shtml" ## [37] "/players/a/alldrca01.shtml" ## [38] "/players/a/allenau01.shtml" ## [39] "/players/a/allengr01.shtml" ## [40] "/players/a/allenlo01.shtml" ## [41] "/players/a/allenlo02.shtml" ## [42] "/players/a/allenni02.shtml" ## [43] "/players/a/almonye01.shtml" ## [44] "/players/a/almoral01.shtml" ## [45] "/players/a/alonspe01.shtml" ## [46] "/players/a/altavda01.shtml" ## [47] "/players/a/altuvjo01.shtml" ## [48] "/players/a/aluja01.shtml" ## [49] "/players/a/alvarjo03.shtml" ## [50] "/players/a/alvarar01.shtml" ## [51] "/players/a/alvared01.shtml" ## [52] "/players/a/alvarfr01.shtml" ## [53] "/players/a/alvarna01.shtml" ## [54] "/players/a/alvaryo01.shtml" ## [55] "/players/a/alzolad01.shtml" ## [56] "/players/a/amadoad01.shtml" ## [57] "/players/a/amayaja01.shtml" ## [58] "/players/a/amayami01.shtml" ## [59] "/players/a/anderbr06.shtml" ## [60] "/players/a/anderch01.shtml" ## [61] "/players/a/anderdr02.shtml" ## [62] "/players/a/andergr01.shtml" ## [63] "/players/a/anderia01.shtml" ## [64] "/players/a/anderju01.shtml" ## [65] "/players/a/anderni01.shtml" ## [66] "/players/a/andersh01.shtml" ## [67] "/players/a/anderti01.shtml" ## [68] "/players/a/anderty01.shtml" ## [69] "/players/a/andrecl02.shtml" ## [70] "/players/a/andrima01.shtml" ## [71] "/players/a/andujmi01.shtml" ## [72] "/players/a/antonte01.shtml" ## [73] "/players/a/arandjo01.shtml" ## [74] "/players/a/arauzjo01.shtml" ## [75] "/players/a/arciaor01.shtml" ## [76] "/players/a/arenano01.shtml" ## [77] "/players/a/ariasga01.shtml" ## [78] "/players/a/armstsh01.shtml" ## [79] "/players/a/arozara01.shtml" ## [80] "/players/a/arraelu01.shtml" ## [81] "/players/a/arrigsp01.shtml" ## [82] "/players/a/arroych01.shtml" ## [83] "/players/a/ashbyaa01.shtml" ## [84] "/players/a/ashcrgr01.shtml" ## [85] "/players/a/assadja01.shtml" ## [86] "/players/a/avilani01.shtml" ## [87] "/players/a/avilape01.shtml" ## [88] "/players/a/azocajo01.shtml" ``` --- class:inverse ## Your Turn Use the SelectorGadget on the website for [Fernando Abad](https://www.baseball-reference.com/players/a/abadfe01.shtml) Find the css selector to extract his career statistics and load them into your R session. Does the same code work to extract career statistics for (some) of the other active players? What other information do we need to know? - and how can we get to that? --- ## Your Turn - Solution ```r url <- "https://www.baseball-reference.com/players/a/abadfe01.shtml" html <- read_html(url) table_col_names <- html %>% html_elements("span strong") %>% html_text() stats <- html %>% html_elements(".stats_pullout p") %>% html_text() stats <- matrix(stats, nrow = 1) stats <- as.data.frame(stats) colnames(stats) <- table_col_names stats ``` ``` ## SUMMARY WAR W L ERA G GS SV IP SO WHIP ## 1 Career 3.2 9 29 3.78 406 6 2 354.2 292 1.322 ``` --- ## Your Turn - Solution (cont'd) It's sometimes easier (for data munging after extracting) to extract multiple pieces rather than everything in one go. ```r (stats <- html %>% html_elements("span strong") %>% html_text()) ``` ``` ## [1] "SUMMARY" "WAR" "W" "L" "ERA" ## [6] "G" "GS" "SV" "IP" "SO" ## [11] "WHIP" ``` ```r (season <- html %>% html_elements(".stats_pullout p:nth-child(2)") %>% html_text()) ``` ``` ## [1] "Career" "3.2" "9" "29" "3.78" "406" ## [7] "6" "2" "354.2" "292" "1.322" ``` ```r # (career <- html %>% html_elements(".stats_pullout p:nth-child(3)") %>% html_text()) ``` --- ## Package `rvest` The `session` suite of commands allows to simulate an html session without a browser. Create a session with `session(url)` Navigate: `session_jump_to()` Follow a link: `session_follow_link()`. navigate back and forward with `session_back()` and `session_forward()`. ... and extract the pieces you are interested in using `read_html`, `html_element`, `html_elements`