class: center, middle, inverse, title-slide .title[ # Pulling data from different sources ] .author[ ### DS 2020 ] --- class: middle, inverse, center # Joining data from different tables --- # Messy (4) Messy (4): *Values for a single observational unit are stored across multiple tables.* After data are normalized by splitting, we want to de-normalize again by *joining* datasets. --- # Example: Lahman package Sean Lahman is a database journalist, who started databases of historical sports statistics, in particular, the Lahman database on baseball. ```r library(Lahman) LahmanData ``` ``` ## file class nobs nvar title ## 1 AllstarFull data.frame 5454 8 AllstarFull table ## 2 Appearances data.frame 110423 21 Appearances table ## 3 AwardsManagers data.frame 179 6 AwardsManagers table ## 4 AwardsPlayers data.frame 6531 6 AwardsPlayers table ## 5 AwardsShareManagers data.frame 425 7 AwardsShareManagers table ## 6 AwardsSharePlayers data.frame 6879 7 AwardsSharePlayers table ## 7 Batting data.frame 110495 22 Batting table ## 8 BattingPost data.frame 15879 22 BattingPost table ## 9 CollegePlaying data.frame 17350 3 CollegePlaying table ## 10 Fielding data.frame 147080 18 Fielding table ## 11 FieldingOF data.frame 12028 6 FieldingOF table ## 12 FieldingOFsplit data.frame 34563 18 FieldingOFsplit table ## 13 FieldingPost data.frame 15063 17 FieldingPost data ## 14 HallOfFame data.frame 4191 9 Hall of Fame Voting Data ## 15 HomeGames data.frame 3195 9 HomeGames table ## 16 Managers data.frame 3684 10 Managers table ## 17 ManagersHalf data.frame 93 10 ManagersHalf table ## 18 Parks data.frame 255 6 Parks table ## 19 People data.frame 20370 26 People table ## 20 Pitching data.frame 49430 30 Pitching table ## 21 PitchingPost data.frame 6308 30 PitchingPost table ## 22 Salaries data.frame 26428 5 Salaries table ## 23 Schools data.frame 1207 5 Schools table ## 24 SeriesPost data.frame 367 9 SeriesPost table ## 25 Teams data.frame 2985 48 Teams table ## 26 TeamsFranchises data.frame 120 4 TeamFranchises table ## 27 TeamsHalf data.frame 52 10 TeamsHalf table ``` --- # Lahman database The Lahman database consists of 27 data frames that are linked by `playerID`. <br> This is clean, but not very readable. <br> The `People` table includes names and other attributes for each player. <br> *Joining* multiple tables helps us to bring together (pieces of) information from multiple sources. --- # Example: Hall of Fame Who was inducted in the Hall of Fame in 2018? (Note: not all in this subset) ```r HallOfFame %>% filter(yearID==2018) ``` ``` ## playerID yearID votedBy ballots needed votes inducted category needed_note ## 1 jonesch06 2018 BBWAA 422 317 410 Y Player <NA> ## 2 guerrvl01 2018 BBWAA 422 317 392 Y Player <NA> ## 3 thomeji01 2018 BBWAA 422 317 379 Y Player <NA> ## 4 hoffmtr01 2018 BBWAA 422 317 337 Y Player <NA> ## 5 martied01 2018 BBWAA 422 317 297 N Player <NA> ## 6 mussimi01 2018 BBWAA 422 317 268 N Player <NA> ## 7 clemero02 2018 BBWAA 422 317 242 N Player <NA> ## 8 bondsba01 2018 BBWAA 422 317 238 N Player <NA> ## 9 schilcu01 2018 BBWAA 422 317 216 N Player <NA> ## 10 vizquom01 2018 BBWAA 422 317 156 N Player <NA> ## 11 walkela01 2018 BBWAA 422 317 144 N Player <NA> ## 12 mcgrifr01 2018 BBWAA 422 317 98 N Player <NA> ## 13 ramirma02 2018 BBWAA 422 317 93 N Player <NA> ## 14 kentje01 2018 BBWAA 422 317 61 N Player <NA> ## 15 sheffga01 2018 BBWAA 422 317 47 N Player <NA> ## 16 wagnebi02 2018 BBWAA 422 317 47 N Player <NA> ## 17 rolensc01 2018 BBWAA 422 317 43 N Player <NA> ## 18 sosasa01 2018 BBWAA 422 317 33 N Player <NA> ## 19 jonesan01 2018 BBWAA 422 317 31 N Player <NA> ## 20 moyerja01 2018 BBWAA 422 317 10 N Player <NA> ## 21 santajo01 2018 BBWAA 422 317 10 N Player <NA> ## 22 damonjo01 2018 BBWAA 422 317 8 N Player <NA> ## 23 matsuhi01 2018 BBWAA 422 317 4 N Player <NA> ## 24 carpech01 2018 BBWAA 422 317 2 N Player <NA> ## 25 woodke02 2018 BBWAA 422 317 2 N Player <NA> ## 26 hernali01 2018 BBWAA 422 317 1 N Player <NA> ## 27 leeca01 2018 BBWAA 422 317 1 N Player <NA> ## 28 hudsoor01 2018 BBWAA 422 317 0 N Player <NA> ## 29 huffau01 2018 BBWAA 422 317 0 N Player <NA> ## 30 isrinja01 2018 BBWAA 422 317 0 N Player <NA> ## 31 lidgebr01 2018 BBWAA 422 317 0 N Player <NA> ## 32 millwke01 2018 BBWAA 422 317 0 N Player <NA> ## 33 zambrca01 2018 BBWAA 422 317 0 N Player <NA> ## 34 morrija02 2018 Veterans NA NA NA Y Player <NA> ## 35 trammal01 2018 Veterans NA NA NA Y Player <NA> ``` --- class: inverse # Your turn For this your turn use the `HallOfFame` and `People` data from the `Lahman` package - Identify the four players who were inducted in the Hall of Fame in 2018 by the BBWAA (Baseball writers' association of America), by filtering the `People` data for their player IDs. - Daredevils only: can you get the function `left_join` to work (read up on the help), so that all of the names are included in the 2018 Hall of Fame candidates? --- # Joins - general idea ![](images/join.png) --- # Joins - more specific idea - Data sets are joined along values of variables. - In `dplyr` there are several join functions: `left_join`, `inner_join`, `full_join`, ... - Differences between join functions only visible, if not all values in one set have values in the other --- ```r df1 <- data.frame(id = 1:6, trt = rep(c("A", "B", "C"), rep=c(2,1,3)), value = c(5,3,7,1,2,3)) df1 ``` ``` ## id trt value ## 1 1 A 5 ## 2 2 B 3 ## 3 3 C 7 ## 4 4 A 1 ## 5 5 B 2 ## 6 6 C 3 ``` ```r df2 <- data.frame(id=c(4,4,5,5,7,7), stress=rep(c(0,1), 3), bpm = c(65, 125, 74, 136, 48, 110)) df2 ``` ``` ## id stress bpm ## 1 4 0 65 ## 2 4 1 125 ## 3 5 0 74 ## 4 5 1 136 ## 5 7 0 48 ## 6 7 1 110 ``` --- # Left join - all elements in the *left* data set are kept - non-matches are filled in by `NA` - `right_join` works symmetric ```r left_join(df1, df2, by="id") ``` ``` ## id trt value stress bpm ## 1 1 A 5 NA NA ## 2 2 B 3 NA NA ## 3 3 C 7 NA NA ## 4 4 A 1 0 65 ## 5 4 A 1 1 125 ## 6 5 B 2 0 74 ## 7 5 B 2 1 136 ## 8 6 C 3 NA NA ``` --- # Inner join - only matches from both data sets are kept ```r inner_join(df1, df2, by = "id") ``` ``` ## id trt value stress bpm ## 1 4 A 1 0 65 ## 2 4 A 1 1 125 ## 3 5 B 2 0 74 ## 4 5 B 2 1 136 ``` --- # Full join - all ids are kept, missings are filled in with `NA` ```r full_join(df1, df2, by = "id") ``` ``` ## id trt value stress bpm ## 1 1 A 5 NA NA ## 2 2 B 3 NA NA ## 3 3 C 7 NA NA ## 4 4 A 1 0 65 ## 5 4 A 1 1 125 ## 6 5 B 2 0 74 ## 7 5 B 2 1 136 ## 8 6 C 3 NA NA ## 9 7 <NA> NA 0 48 ## 10 7 <NA> NA 1 110 ``` --- class:inverse # Your turn (10 mins) - Load the `Lahman` package into your R session. - Join (relevant pieces of) the `People` data set and the `HallOfFame` data. - Find all Hall of Famers who were alive as of 2018. - How many attempts at being inducted to the HoF does Sammy Sosa have already? (as of 2022 he is [no longer under consideration for HoF](https://en.wikipedia.org/wiki/2022_Baseball_Hall_of_Fame_balloting)) --- # Traps of joins - sometimes we unexpectedly cannot match values: missing values, different spelling, ... - join can be along multiple variables, e.g. `by = c("ID", "Date")` - joining variable(s) can have different names, e.g. `by = c("State" = "Name")` - always make sure to check dimensions of data before and after a join - check on missing values; help with that: `anti_join` --- # Anti join - a neat function in `dplyr` - careful, not symmetric! ```r anti_join(df1, df2, by="id") # no values for id in df2 ``` ``` ## id trt value ## 1 1 A 5 ## 2 2 B 3 ## 3 3 C 7 ## 4 6 C 3 ``` ```r anti_join(df2, df1, by="id") # no values for id in df1 ``` ``` ## id stress bpm ## 1 7 0 48 ## 2 7 1 110 ``` <!-- --- --> <!-- class: inverse --> <!-- # Your turn - Homework #7 --> <!-- For this example, use the data from the `classdata` package --> <!-- - Load the `classdata` package into your R session. --> <!-- - Investigate data sets `box` and `budget`. --> <!-- - Join the two datasets to incorporate the release date of movies into the box office gross. --> <!-- - Check on the dimensions of the data sets before and after the join. Where are potential problems? --> <!-- - Use `anti_join` to detect problematic cases. --> <!-- - Use `box` to show cumulative box office revenue over time. Extra point: label very successful movies in a readable manner. --> <!-- - Identify box office hits and flops and visualize them. -->