Reshaping data

class: center, middle, inverse, title-slide

.title[
# Reshaping data
]
.author[
### Will Ju
]

---

class: middle, inverse, center
# The `tidyr` package

---

# What is tidy data?

> Happy families are all alike; every unhappy family is unhappy in its own way.<br>
Leo Tolstoy

- `tidyr` vignette available as `vignette("tidy-data", package="tidyr")`

- vignette is updated version of the [tidy data paper](https://www.jstatsoft.org/article/view/v059i10) with updated code

---

# Outline

- Different sources of messiness

- Key-Value pairs

- `tidyr`: `pivot_longer` and `pivot_wider` (former `gather` and `spread`)

---

# Sources of Messiness

1. Column headers are values, not variable names.<br>
e.g. *treatmenta, treatmentb*

2. Multiple variables are stored in one column.<br>
e.g. *Fall 2015, Spring 2016* or *"1301 8th St SE, Orange City, Iowa 51041
(42.99755, -96.04149)", "2102 Durant, Harlan, Iowa 51537
(41.65672, -95.33780)"*

3. Multiple observational units are stored in the same table.

4. A single observational unit is stored in multiple tables.

---

# Data often is in a spreadsheet format, but ...

there's different ways of encoding the same information:

Option #1

```
##           name treatmenta treatmentb
## 1   John Smith         NA         18
## 2     Jane Doe          4          1
## 3 Mary Johnson          6          7
```

Option #2

```
##   treatment John.Smith Jane.Doe Mary.Johnson
## 1         a         NA        4            6
## 2         b         18        1            7
```

Neither #1 nor #2 are "clean" versions of the data: not key-value pairs

---

# Tidy data

1. Each variable forms one column.

2. Each observation forms one row.

3. Each type of observational unit forms a table.

---

# Clean version of the example

```
##   treatment      patient score
## 1         a   John Smith    NA
## 2         b   John Smith    18
## 3         a     Jane Doe     4
## 4         b     Jane Doe     1
## 5         a Mary Johnson     6
## 6         b Mary Johnson     7
```

- `treatment` and `patient` uniquely describe a single row in the dataset.

- `treatment` and `patient` are **key variables**,

- `score` is a **measurement variable**

- this makes `treatment-patient` and `score` a **key-value pair**

---

# Key-value pairs (KVP)
 
**Key-Value pairs** (KVP) - also *attribute-value*, *field-value*, *name-value*:
abstract data representation that allows a lot of flexibility

One way of telling whether a data set is tidy is to check that all keys for a value are aligned in one row:

---

# Tidying data - Plan of attack

Very few functions are needed for tidying data:

- Messy (1): `tidyr` functions `pivot_longer` and `pivot_wider`.

- `pivot_longer(data,  cols, names_to = "name", values_to = "values")`: take multiple columns and collapse into key-value pairs 
    - `pivot_wider (data, names_from=name, values_from=value)`: spread a key-value pair across multiple columns.

- Messy (2): `tidyr` function `separate (data, col, into, sep = "[^[:alnum:]]+")`:
separate one column into multiple columns

- Messy (3): `dplyr` - some combination of the functions discussed previously

- Messy (4): `dplyr` functionality `join` (and friends) to combine multiple data sets

---

# French fries data

```r
data(french_fries, package="reshape2")
```

- data from sensory experiment conducted at Iowa State University in 2004

- investigators were interested in comparing effects of three different fryer oils on taste of fries

```r
french_fries %>% head()
```

```
##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3
```

---

# Pivot to long

```r
ffm <- french_fries %>% 
  pivot_longer(cols = potato:painty, 
               names_to = "scale", 
               values_to = "score")
ffm
```

```
## # A tibble: 3,480 × 6
##    time  treatment subject   rep scale   score
##    <fct> <fct>     <fct>   <dbl> <chr>   <dbl>
##  1 1     1         3           1 potato    2.9
##  2 1     1         3           1 buttery   0  
##  3 1     1         3           1 grassy    0  
##  4 1     1         3           1 rancid    0  
##  5 1     1         3           1 painty    5.5
##  6 1     1         3           2 potato   14  
##  7 1     1         3           2 buttery   0  
##  8 1     1         3           2 grassy    0  
##  9 1     1         3           2 rancid    1.1
## 10 1     1         3           2 painty    0  
## # ℹ 3,470 more rows
```

---

# Pivot to long

---
class: inverse
# Your Turn

Load the Avengers' data from FiveThirtyEight's survey into your work session:

Data Description: [https://github.com/fivethirtyeight/data/tree/master/avengers](https://github.com/fivethirtyeight/data/tree/master/avengers)

```r
av <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv", stringsAsFactors = FALSE)
```

- Use the `select` function and the selector `starts_with` (read up on `?starts_with`) to select the name of the Avenger and all five death variables. Inspect the result with `head`.

- Create a long form of the five death variables in the `av` data set using the function `pivot_longer`. Call the names `Time` and the values `Died`. Select `URL`, `Name.Alias`, `Time`, and `Died`, and save the result as `deaths`. Inspect the data set.

- Create a (temporary) frequency breakdown of the variable `Died`. How can we interpret these numbers?

---
# Pivot to wide

```r
ffm %>% pivot_wider(names_from = rep, values_from = score)
```

```
## # A tibble: 1,740 × 6
##    time  treatment subject scale     `1`   `2`
##    <fct> <fct>     <fct>   <chr>   <dbl> <dbl>
##  1 1     1         3       potato    2.9  14  
##  2 1     1         3       buttery   0     0  
##  3 1     1         3       grassy    0     0  
##  4 1     1         3       rancid    0     1.1
##  5 1     1         3       painty    5.5   0  
##  6 1     1         10      potato   11     9.9
##  7 1     1         10      buttery   6.4   5.9
##  8 1     1         10      grassy    0     2.9
##  9 1     1         10      rancid    0     2.2
## 10 1     1         10      painty    0     0  
## # ℹ 1,730 more rows
```

```r
#ffm %>% spread(key = rep, value = score)
```

---

```r
ffm %>% 
  pivot_wider(
    names_from = rep, values_from = score) %>%
  ggplot(aes(x = `1`, y = `2`)) + geom_point() +
  facet_wrap(~scale) + geom_abline(colour = "grey50")
```

![](05_tidyr_files/figure-html/unnamed-chunk-9-1.png)

---
class: inverse
# Your turn

For this your turn use the `french_fries` data from the `reshape2` package: `data("french_fries", package="reshape2")`

- Use `pivot_longer` from the `tidyr` package to combine the different scales for assessing french fries into a single variable. Call the key-value pair "scale" and "score".

- Use `pivot_wider` from the `tidyr` package to get a format in which you can directly compare values from week 1 to week 10. Plot a scatterplot of values in week 1 against week 10. Facet by treatment and scale, color by individuals   and use different shapes for the replicates. Is there a pattern visible?

---
class: inverse
# Your turn

For this your turn use the `fbiwide` data from the `classdata` package: `data("fbiwide", package="classdata")`

- Use  `pivot_longer` from the `tidyr` package to combine the variables for the different types of crimes into one variable. Call the key-value pair "Type" and "Incidences". Compute a crime rate

- Only consider crime rates for Iowa and Minnesota. Use  `pivot_wider` to create incidence columns for each of these states. Plot crimes in Iowa against crimes in Minnesota, colour by type of crime. Note: you need to exclude some variables.

---

# `pivot_wider` works as inverse of `pivot_longer` ...

![](images/spread-gather-clean.png)

---

# Front part is not identical - not collapsible

![](images/spread-gather-unclean.png)

---

# Where using `pivot_wider` is tricky

- `pivot_wider` is the (almost) inverse operation of `pivot_longer`

- when we use `pivot_longer` the key information is duplicated

- when using `pivot_wider` we need to make sure that the information outside the key and value is identical for all levels (and combinations of levels) of the `key` variable.

- If the info is not identical across levels, missing values are introduced into the new data set.

---

# `pivot_wider` in the example

```r
library(classdata)
fbi %>% filter(year==2014, state_abbr %in% c("IA", "MN"),
               type=="burglary")
```

```
## # A tibble: 2 × 8
##   state     state_id state_abbr  year population type     count violent_crime
##   <chr>        <int> <chr>      <int>      <int> <chr>    <int> <lgl>        
## 1 Iowa            16 IA          2014    3109481 burglary 14363 FALSE        
## 2 Minnesota       27 MN          2014    5457125 burglary 20773 FALSE
```

Other state specific values don't allow spread to collapse the front part and introduce missing values:

```r
fbi %>% filter(year==2014, state %in% c("Iowa", "Minnesota"),  type=="burglary") %>% 
  pivot_wider(names_from=state, values_from=count)
```

```
## # A tibble: 2 × 8
##   state_id state_abbr  year population type     violent_crime  Iowa Minnesota
##      <int> <chr>      <int>      <int> <chr>    <lgl>         <int>     <int>
## 1       16 IA          2014    3109481 burglary FALSE         14363        NA
## 2       27 MN          2014    5457125 burglary FALSE            NA     20773
```

---

# fixing `pivot_wider` in the example

Only select all relevant variables (or negative select all state-specific variables) to get the example to "work":

```r
fbi %>% filter(year==2020, 
               state_abbr %in% c("IA", "MN"),
               type=="burglary") %>% 
  select(year, type, state, count) %>% 
  pivot_wider(names_from=state, values_from=count)
```

```
## # A tibble: 1 × 4
##    year type      Iowa Minnesota
##   <int> <chr>    <int>     <int>
## 1  2020 burglary 11739     17370
```