Reshaping data with tidyr - working with separate and unite

class: center, middle, inverse, title-slide

.title[
# Reshaping data with tidyr - working with separate and unite
]
.author[
### Will Ju
]

---

# Outline

- cleaning data (first run)

- functions `separate` and `unite`

---
# parse_number

the tidyverse package `readr` contains function `parse_number`

```r
library(readr)
x <- c("3.14", "5.2%", "$10", "5,321.00")
x
```

```
## [1] "3.14"     "5.2%"     "$10"      "5,321.00"
```

```r
parse_number(x)
```

```
## [1]    3.14    5.20   10.00 5321.00
```
---
class: inverse
# Your Turn

Work on the previously created data frame `deaths` (Avengers). The description of the data is here [https://github.com/fivethirtyeight/data/tree/master/avengers](https://github.com/fivethirtyeight/data/tree/master/avengers)

- Use `parse_number` to extract from the variable `Time` a number. Inspect the result.

- Group by Avenger (use URL) and `Died`. Find the maximum of `Time`. Call the result `maxdeaths`. Interpret the resulting data set.

```
## `summarise()` has grouped output by 'URL'. You can override using the `.groups`
## argument.
```

```
## # A tibble: 10 × 3
## # Groups:   URL [10]
##    URL                                                        Died   Time
##    <chr>                                                      <chr> <dbl>
##  1 http://marvel.wikia.com/2ZP45-9-X-51_(Earth-616)#          YES       1
##  2 http://marvel.wikia.com/Abyss_(Ex_Nihilo%27s)_(Earth-616)# YES       1
##  3 http://marvel.wikia.com/Adam_Brashear_(Earth-616)#         NO        1
##  4 http://marvel.wikia.com/Alani_Ryan_(Earth-616)#            NO        1
##  5 http://marvel.wikia.com/Alexander_Summers_(Earth-616)#     NO        1
##  6 http://marvel.wikia.com/Alexis_(Earth-616)#                NO        1
##  7 http://marvel.wikia.com/Amadeus_Cho_(Earth-616)#           NO        1
##  8 http://marvel.wikia.com/America_Chavez_(Earth-616)#        NO        1
##  9 http://marvel.wikia.com/Angelica_Jones_(Earth-616)#        NO        1
## 10 http://marvel.wikia.com/Anthony_Druid_(Earth-616)#         YES       2
```
---
class: inverse
# Your Turn

Work with the `maxdeaths` dataset. Find the frequency break down by Time and Died (`group_by`, `tally`).

```
## # A tibble: 6 × 3
## # Groups:   Died [2]
##   Died   Time     n
##   <chr> <dbl> <int>
## 1 NO        1   104
## 2 NO        2     1
## 3 YES       1    53
## 4 YES       2    14
## 5 YES       3     1
## 6 YES       5     1
```
Use the same steps for all of the Avengers' returns and you have the basic info for all statements for lab #3.

---
# separate

Messy (2): Multiple variables are stored in one column

```r
library(tidyverse)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df
```

```
##      x
## 1 <NA>
## 2  a.b
## 3  a.d
## 4  b.c
```

```r
df %>% separate(x, into = c("A", "B"))
```

```
##      A    B
## 1 <NA> <NA>
## 2    a    b
## 3    a    d
## 4    b    c
```

---
class: inverse
# Your Turn (5 min)

The Iowa Data Portal is a wealth of information on and about the State of Iowa.

The website 
[Liquor Sales](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy) provides data on every liquor sale in a licensed store in Iowa. The code below reads (part of) the data into an R session.

```
url <- "https://github.com/ds202-at-ISU/materials/blob/master/03_tidyverse/data/ames-liquor.rds?raw=TRUE"
download.file(url, "ames-liquor.rds", mode="wb")
ames <- readRDS("ames-liquor.rds")
```

---

# Problems with the data

- `Date` is text, in the format of Month/Day/Year (Messy 2)

- Store location is a textual expression of form `POINT (`...`)` and geographic latitude and longitude. (Messy 2)

no Messy 1? - problems of type Messy 1 are typically hard to detect and often up to interpretation/dependent on the analysis to be done.

---
class: inverse
# Your Turn (5 min)

- Check the help for the function `parse_number` in the `readr` package and use it on the store location. What result do you get?

- Use `separate` to split the variable for store location into longitude and latitude.
 (Hint - you might need several steps of `separate`)

- Make sure that latitude and longitude are numeric variables.

---
class: inverse
# Your Turn - exploration (5 mins)

Use `dplyr` functionality to answer the following questions:

1. What is the total amount spent on Liquor Sales?

2. What is the single largest sale (in volume/in dollar amount)?

4. Plot geographic longitude and latitude. Where are liquor sales in Ames happening?