class: center, middle, inverse, title-slide .title[ # Reshaping data with tidyr - working with separate and unite ] .author[ ### Will Ju ] --- # Outline - cleaning data (first run) - functions `separate` and `unite` --- # parse_number the tidyverse package `readr` contains function `parse_number` ```r library(readr) x <- c("3.14", "5.2%", "$10", "5,321.00") x ``` ``` ## [1] "3.14" "5.2%" "$10" "5,321.00" ``` ```r parse_number(x) ``` ``` ## [1] 3.14 5.20 10.00 5321.00 ``` --- class: inverse # Your Turn Work on the previously created data frame `deaths` (Avengers). The description of the data is here [https://github.com/fivethirtyeight/data/tree/master/avengers](https://github.com/fivethirtyeight/data/tree/master/avengers) - Use `parse_number` to extract from the variable `Time` a number. Inspect the result. - Group by Avenger (use URL) and `Died`. Find the maximum of `Time`. Call the result `maxdeaths`. Interpret the resulting data set. ``` ## `summarise()` has grouped output by 'URL'. You can override using the `.groups` ## argument. ``` ``` ## # A tibble: 10 × 3 ## # Groups: URL [10] ## URL Died Time ## <chr> <chr> <dbl> ## 1 http://marvel.wikia.com/2ZP45-9-X-51_(Earth-616)# YES 1 ## 2 http://marvel.wikia.com/Abyss_(Ex_Nihilo%27s)_(Earth-616)# YES 1 ## 3 http://marvel.wikia.com/Adam_Brashear_(Earth-616)# NO 1 ## 4 http://marvel.wikia.com/Alani_Ryan_(Earth-616)# NO 1 ## 5 http://marvel.wikia.com/Alexander_Summers_(Earth-616)# NO 1 ## 6 http://marvel.wikia.com/Alexis_(Earth-616)# NO 1 ## 7 http://marvel.wikia.com/Amadeus_Cho_(Earth-616)# NO 1 ## 8 http://marvel.wikia.com/America_Chavez_(Earth-616)# NO 1 ## 9 http://marvel.wikia.com/Angelica_Jones_(Earth-616)# NO 1 ## 10 http://marvel.wikia.com/Anthony_Druid_(Earth-616)# YES 2 ``` --- class: inverse # Your Turn Work with the `maxdeaths` dataset. Find the frequency break down by Time and Died (`group_by`, `tally`). ``` ## # A tibble: 6 × 3 ## # Groups: Died [2] ## Died Time n ## <chr> <dbl> <int> ## 1 NO 1 104 ## 2 NO 2 1 ## 3 YES 1 53 ## 4 YES 2 14 ## 5 YES 3 1 ## 6 YES 5 1 ``` Use the same steps for all of the Avengers' returns and you have the basic info for all statements for lab #3. --- # separate Messy (2): Multiple variables are stored in one column ```r library(tidyverse) df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")) df ``` ``` ## x ## 1 <NA> ## 2 a.b ## 3 a.d ## 4 b.c ``` ```r df %>% separate(x, into = c("A", "B")) ``` ``` ## A B ## 1 <NA> <NA> ## 2 a b ## 3 a d ## 4 b c ``` --- class: inverse # Your Turn (5 min) The Iowa Data Portal is a wealth of information on and about the State of Iowa. The website [Liquor Sales](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy) provides data on every liquor sale in a licensed store in Iowa. The code below reads (part of) the data into an R session. ``` url <- "https://github.com/ds202-at-ISU/materials/blob/master/03_tidyverse/data/ames-liquor.rds?raw=TRUE" download.file(url, "ames-liquor.rds", mode="wb") ames <- readRDS("ames-liquor.rds") ``` --- # Problems with the data - `Date` is text, in the format of Month/Day/Year (Messy 2) - Store location is a textual expression of form `POINT (`...`)` and geographic latitude and longitude. (Messy 2) no Messy 1? - problems of type Messy 1 are typically hard to detect and often up to interpretation/dependent on the analysis to be done. --- class: inverse # Your Turn (10 min) - Check the help for the function `parse_number` in the `readr` package and use it on the store location. What result do you get? - Use `separate` to split the variable for store location into longitude and latitude. (Hint - you might need several steps of `separate`) - Make sure that latitude and longitude are numeric variables. --- class: inverse # Your Turn - exploration (10 mins) Use `dplyr` functionality to answer the following questions: 1. What is the total amount spent on Liquor Sales? 2. What is the single largest sale (in volume/in dollar amount)? 4. Plot geographic longitude and latitude. Where are liquor sales in Ames happening?