class: center, middle, inverse, title-slide .title[ # More graphics with ggplot2 ] .author[ ### Will Ju ] --- class: inverse, center, middle # Looking ... some more ... at data --- ## Plan for answers 1. Explore how one (or more) variables are distributed: *barchart or histogram* 2. Explore how two variables are related: *scatterplot, boxplot, tile plot* 3. Explore how two variables are related, conditioned on other variables: *facetting, color & other aesthetics* Look at 3 next, then come back to 1 and 2. --- ## Getting ready Load libraries: ```r library(ggplot2) # not found? run install.packages("ggplot2") library(classdata) # not found? run remotes::install_github("heike/classdata") ``` --- ## Facetting Use facet to display plots for different subsets: `facet_wrap`, `facet_grid` ```r ggplot(aes(x = year, y = homicide), data=fbiwide) + facet_wrap(~state, ncol = 11) + geom_point() ``` ![](03_r-graphics_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Setup of `facet_wrap` and `facet_grid` - `facet_grid` has formula specification: `rows ~ cols` - `facet_wrap` has specification `~ variables` - multiple variables (in either specification) are included in form of a sum, i.e. `rowvar1 + rowvar2 ~ colvar1+ colvar2` - no variable (in `facet_grid`) is written as `.`, i.e. `rowvar ~ .` are plots in a single column. --- class: inverse ## Your turn Use the `fbiwide` data from the package `classdata` for this your turn. - Plot the number of car thefts by year for each state (facet by state). - The numbers are dominated by the number of thefts in California, New York, and Texas. Use a log-scale for the y-axis. Does that help? - Another approach to fix the domination by CA, TX and NY: Read up on the parameters in `facet_wrap` to find a way to give each panel its own scale. Comment on the difference in the results. --- ## Facets vs aesthetics? - Will need to experiment as to which one answers your question/tells the story best - Rule of thumb: comparisons of interest should be close together --- ## Boxplots ![](03_r-graphics_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## Boxplot definition - definition by J.W. Tukey (1960s, EDA 1977) ![](images/boxplots-definition.png) --- ## Boxplots - are used for group comparisons and outlier identifications - usually only make sense in form of side-by-side boxplots. - `geom_boxplot` in ggplot2 needs `x` and `y` variable (`y` is measurement, `x` is categorical) ```r ggplot(data = fbi, aes(x = type, y = log10(count))) + geom_boxplot() + coord_flip() ``` ![](03_r-graphics_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- class: inverse ## Your turn - Using ggplot2, draw side-by-side boxplots of the number of robberies by state. Use a log transformation on y and compare results. - **Stretch goal:** Compare rates of robberies by state, i.e. adjust robberies by the state population. Then plot side-by-side boxplots. --- ## Boxplots - Pros and Cons - **Pros:** - Symmetry vs Skewness - Outliers - Quick Summary - Comparisons across multiple Treatments (side by side boxplots) - **Cons:** - Boxplots hide multiple modes and gaps in the data --- ## Univariate plots Histograms: ```r ggplot(fbiwide, aes(x = motor_vehicle_theft)) + geom_histogram(binwidth=5000) + ggtitle("binwidth = 5000") ``` ![](03_r-graphics_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## Univariate plots Histograms: ```r ggplot(fbiwide, aes(x = motor_vehicle_theft)) + geom_histogram(binwidth=1000) + ggtitle("binwidth = 1000") ``` ![](03_r-graphics_files/figure-html/unnamed-chunk-6-1.png)<!-- --> Always change the bin width! There is no perfect bin width, but a lot of different views of the data at different resolutions! --- ## Barchart ```r ggplot(fbi, aes(x = type)) + geom_bar(aes(weight= count)) + coord_flip() ``` ![](03_r-graphics_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- ## Histograms and barcharts What do we look for? - Symmetry/Skewness - Modes, Groups (big pattern: where is the bulk of the data?) - Gaps & Outliers (deviation from the big pattern: where are the other points?) For the histogram, always choose the binwidth consciously In a barchart, choose the order of the categories consciously (later) --- class: inverse ## Your turn - Use the `fbi` data set to draw a barchart of the variable `violent_crime`. Make the height of the bars dependent on the number of reports (use `weight`). Then facet by type (does the result match your expectation? good! get rid of facetting). Color bars by `type`. - Use the `fbi` data set to draw a histogram of the number of reports. Facet by type, make sure to use individual scales for the panels. --- ## More on `ggplot2` - reference/document: http://ggplot2.tidyverse.org/reference/ - RStudio cheat sheet for [ggplot2](https://rstudio.github.io/cheatsheets/html/data-visualization.html?_gl=1*1otqt7q*_ga*MTI3NTQxNjE5Mi4xNjkxNjM0NjQw*_ga_2C0WZ1JHG0*MTY5NDQ3MzE5My45LjAuMTY5NDQ3MzE5My4wLjAuMA..) - ggplot2 mailing list: https://groups.google.com/forum/?fromgroups#!forum/ggplot2