More graphics with ggplot2

class: center, middle, inverse, title-slide

.title[
# More graphics with ggplot2
]
.author[
### Will Ju
]

---

class: inverse, center, middle
# Looking ... some more ... at data

---

## Plan for answers

1. Explore how one (or more) variables are distributed: *barchart or histogram*

2. Explore how two variables are related:  *scatterplot, boxplot, tile plot*

3. Explore how two variables are related, conditioned on other variables: *facetting, color & other aesthetics*

Look at 3 next, then come back to 1 and 2.

---

## Getting ready

Load libraries:

```r
library(ggplot2) 
# not found? run install.packages("ggplot2")

library(classdata) 
# not found? run remotes::install_github("heike/classdata")
```

---

## Facetting

Use facet to display plots for different subsets: `facet_wrap`, `facet_grid`

```r
ggplot(aes(x = year, y = homicide), data=fbiwide) +
  facet_wrap(~state, ncol = 11) +
  geom_point()
```

![](03_r-graphics_files/figure-html/unnamed-chunk-2-1.png)

---

## tiny example

```r
library(dplyr)
df1 <- data.frame(
  x = seq(-1, 1, 0.1),
  type = 'up'
)
df1$y <- df1$x

df2 <- data.frame(
  x = seq(-1, 1, 0.1),
  type = 'down'
)
df2$y <- -df2$x

df <- rbind(df1, df2)

df <- df %>% mutate(
  sign = if_else(y >= 0, 'positive', 'negative')
)
```

---

## tiny example

```r
ggplot(df, aes(x = x, y = y,
               color = type)) +
  geom_point() + 
  facet_wrap(~type)
```

![](03_r-graphics_files/figure-html/unnamed-chunk-4-1.png)

---

## Setup of `facet_wrap` and `facet_grid`

- `facet_grid` has formula specification: `rows ~ cols`

- `facet_wrap` has specification `~ variables`

- multiple variables (in either specification) are included in form of a sum, i.e. `rowvar1 + rowvar2 ~ colvar1+ colvar2`

- no variable (in `facet_grid`) is written as `.`, i.e. `rowvar ~ .` are plots in a single column.

---
class: inverse
## Your turn

Use the `fbiwide` data from the package `classdata` for this your turn.

- Plot the number of car thefts by year for each state (facet by state).

- The numbers are dominated by the number of thefts in California, New York, and Texas. Use a log-scale for the y-axis. Does that help?

- Another approach to fix the domination by CA, TX and NY: Read up on the parameters in `facet_wrap` to find a way to give each panel its own scale. Comment on the difference in the results.

---

## Facets vs aesthetics?

- Will need to experiment as to which one answers your question/tells the story best

- Rule of thumb: comparisons of interest should be close together

---

## Boxplots

![](03_r-graphics_files/figure-html/unnamed-chunk-5-1.png)

---

## Boxplot definition

- definition by J.W. Tukey (1960s, EDA 1977)

![](images/boxplots-definition.png)

---

## Boxplots

- are used for group comparisons and outlier identifications
- usually only make sense in form of side-by-side boxplots.
- `geom_boxplot` in ggplot2 needs `x` and `y` variable (`y` is measurement, `x` is categorical)

```r
ggplot(data = fbi, aes(x = type, y = log10(count))) +
  geom_boxplot() + 
  coord_flip()
```

![](03_r-graphics_files/figure-html/unnamed-chunk-6-1.png)

---
class: inverse
## Your turn

- Using ggplot2, draw side-by-side boxplots of the number of robberies by state. Use a log transformation on y and compare results.

- **Stretch goal:** Compare rates of robberies by state, i.e. adjust robberies by the state population. Then plot side-by-side boxplots.

---

## Boxplots - Pros and Cons

- **Pros:**
    - Symmetry vs Skewness
    - Outliers
    - Quick Summary
    - Comparisons across multiple Treatments (side by side boxplots)

- **Cons:**
    - Boxplots hide multiple modes and gaps in the data

---

## Univariate plots

Histograms:

```r
ggplot(fbiwide, aes(x = motor_vehicle_theft)) + 
  geom_histogram(binwidth=5000) +
  ggtitle("binwidth = 5000")
```

![](03_r-graphics_files/figure-html/unnamed-chunk-7-1.png)

---

## Univariate plots

Histograms:

```r
ggplot(fbiwide, aes(x = motor_vehicle_theft)) + 
  geom_histogram(binwidth=1000) +
  ggtitle("binwidth = 1000")
```

![](03_r-graphics_files/figure-html/unnamed-chunk-8-1.png)

Always change the bin width! There is no perfect bin width, but a lot of different views of the data at different resolutions!
---

## Barchart

```r
ggplot(fbi, aes(x = type)) + 
  geom_bar(aes(weight= count)) +
  coord_flip()
```

![](03_r-graphics_files/figure-html/unnamed-chunk-9-1.png)

---

## Histograms and barcharts

What do we look for?

- Symmetry/Skewness

- Modes, Groups  (big pattern: where is the bulk of the data?)

- Gaps & Outliers (deviation from the big pattern: where are the other points?)

For the histogram, always choose the binwidth consciously

In a barchart, choose the order of the categories consciously (later)

---
class: inverse
## Your turn

- Use the `fbi` data set to draw a barchart of the variable `violent_crime`. Make the height of the bars dependent on the number of reports (use `weight`). Then facet by type (does the result match your expectation? good! get rid of facetting). Color bars by `type`.

- Use the `fbi` data set to draw a histogram of the number of reports. Facet by type, make sure to use individual scales for the panels.

---

## More on `ggplot2`

- reference/document: http://ggplot2.tidyverse.org/reference/

- RStudio cheat sheet for [ggplot2](https://rstudio.github.io/cheatsheets/html/data-visualization.html?_gl=1*1otqt7q*_ga*MTI3NTQxNjE5Mi4xNjkxNjM0NjQw*_ga_2C0WZ1JHG0*MTY5NDQ3MzE5My45LjAuMTY5NDQ3MzE5My4wLjAuMA..)

- ggplot2 mailing list: 
https://groups.google.com/forum/?fromgroups#!forum/ggplot2