class: center, middle, inverse, title-slide .title[ # DS 202: Factor variables ] .author[ ### Will Ju ] --- # Factors - A special type of variable to indicate categories - both *labels* and their *order* (i.e. numbers) - By default text variables are stored as text during input - some text variables should be considered categorical - numeric categorical variables have to be converted to factors manually - `factor` creates a new factor with specified labels - `factor` variables are summarized in frequency breakdown --- class: inverse # Your Turn - Inspect the `fbi` object. How many variables are there? Which type does each of the variables have? - Make a summary of year - Make Year a factor variable: `fbi$year <- factor(fbi$year)` - Compare summary of year to the previous result - Are there other variables that should be factors (or vice versa)? ```r library(classdata) library(tidyverse) fbi$type <- factor(fbi$type) ``` --- # Note: factors in boxplots boxplots in ggplot2 only work properly if the x variable is a character variable or a factor: ```r twoyear <- dplyr::filter(fbi, year %in% c(1980, 2016)) ``` .pull-left[ ```r ggplot(data = twoyear, aes(x = year, y = count)) + geom_boxplot() ``` ![](05_factors_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] .pull-right[ ```r ggplot(data = twoyear, aes(x = factor(year), y = count)) + geom_boxplot() ``` ![](05_factors_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- # Data types: checking and casting Checking for, and casting between types: - `str`, `mode` provide info on type - `is.XXX` (with XXX either `factor, int, numeric, logical, character, ...` ) checks for specific type - `as.XXX` casts to specific type --- # Casting between types ![](images/casting.png) **Note:** `as.numeric` applied to a factor retrieves *order* of labels, not labels, even if those could be interpreted as numbers. To get the labels of a factor as numbers, first cast to character and then to a number. --- # Levels of factor variables - `levels(x)` shows us the levels of factor variable `x` in their current order - factor variables often have to be re-ordered for ease of comparisons - We can specify the order of the levels by explicitly listing them, see `help(factor)` - We can make the order of the levels in one variable dependent on the summary statistic of another variable --- # Checking Factor levels order of levels is preserved in charts: ```r fbi %>% ggplot(aes(x = type, y = log10(count))) + geom_boxplot() ``` ![](05_factors_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ```r levels(fbi$type) ``` ``` ## [1] "aggravated_assault" "arson" "burglary" ## [4] "homicide" "larceny" "motor_vehicle_theft" ## [7] "rape_legacy" "rape_revised" "robbery" ``` --- # Reordering factor levels - manual manual re-ordering (extremely sensitive to typos): ```r fbi$type <- factor(fbi$type, levels=c("larceny", "burglary", "motor_vehicle_theft", "arson", "aggravated_assault", "robbery", "rape_legacy", "rape_revised", "homicide")) fbi %>% ggplot(aes(x = type, y = log10(count))) + geom_boxplot() ``` ![](05_factors_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- # Reordering factor levels - using another variable `reorder(factor, numbers, function)` reorder levels in factor by values in `numbers`. Use `function` to summarise (average is used by default). ```r fbi$type <- reorder(fbi$type, fbi$count, na.rm=TRUE) fbi %>% ggplot(aes(x = type, y = log10(count))) + geom_boxplot() ``` ![](05_factors_files/figure-html/unnamed-chunk-8-1.png)<!-- --> missing values in `numbers`? make sure to use parameter `na.rm=TRUE`! --- class: inverse ## Your turn For this your turn use the `fbiwide` object from the `classdata` package. - Plot a boxplot of the number of motor vehicle thefts by year. (you might have to convert year to a factor variable!) - Plot a boxplot of the rate (adjust to some interpretable rate - e.g. Ames standard) of motor vehicle thefts by state (abbreviations). Add coord_flip() in case the state names run into one another. - Reorder the boxplots of rates of motor vehicle thefts, such that the boxplots are ordered by their medians. - Plot (reordered) boxplots by state for another type of crime. --- # Changing Levels' names ```r levels(fbi$type) ``` ``` ## [1] "homicide" "arson" "rape_legacy" ## [4] "rape_revised" "robbery" "aggravated_assault" ## [7] "motor_vehicle_theft" "burglary" "larceny" ``` ```r levels(fbi$type)[6] <- "car_theft" levels(fbi$type) ``` ``` ## [1] "homicide" "arson" "rape_legacy" ## [4] "rape_revised" "robbery" "car_theft" ## [7] "motor_vehicle_theft" "burglary" "larceny" ``` --- # Read more on factors - Wickham & Grolemund's <a href="http://r4ds.had.co.nz/factors.html">chapter on factors</a> in *R for Data Science* - Roger Peng: [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) - Thomas Lumley: <a href="http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh"><em>stringsAsFactors = <sigh></em></a> - The `forcats` package has a lot of additional functions that make working with factors easier.