Lab: Ames Housing Data Exploration

Overview

In this lab, you will explore the ames dataset from the classdata package, which contains residential sales data in Ames, Iowa since 2017. All work should be completed in this .Rmd file and submitted through Canvas.

The main variable of interest is Sale_Price. You will:

  • Investigate this variable and its distribution
  • Identify and explore one other variable that may relate to Sale_Price
  • Use code, visualizations, and written explanation to communicate your findings

Make sure your .Rmd file knits cleanly to .html. You will submit both the .Rmd and the knitted output.

Step 1: Data Exploration

  1. Inspect the first few lines of the data set:
    • What variables are there?
    • Of what type are the variables?
    • What does each variable mean?
    • What do you expect their data ranges to be?

answer

  1. Begin the exploration with the main variable, Sale_Price:
    • What is the range of this variable?
    • Create a histogram to visualize the distribution.
    • What is the general pattern? Are there any unusual or extreme values?

answer

  1. Choose a second variable that may relate to Sale_Price:
    • What is the range of this variable? Plot it and describe the pattern.
    • Explore the relationship between this variable and Sale_Price. Use a scatterplot, boxplot, or faceted bar chart—whichever is most appropriate.
    • Describe the pattern. Does this second variable help explain anything you observed in Sale_Price?

answer

Homework Assignment

Chick Weights

The ChickWeight data set is part of the base package datasets. See ?ChickWeight for details on the data. For all of the questions use dplyr functions whenever possible.

  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name. To get full points, show your R code (in a code chunk) and write answers to the questions.

  2. What variables are part of the dataset?

answer

  1. Get a frequency breakdown of the number of chicks, their average weight and the standard deviation of the weights in each of the diets at the start of the study.
  • extra credit: construct a ggplot that shows average weights by diet with an interval (shown as a line) of +- the standard deviation around the averages. (Hint: read this article regarding ggplot error bars)

answer

  1. Each chick should have twelve weight measurements. Use the dplyr package to identify how many chicks have a complete set of weight measurements and how many measurements there are on average in the incomplete cases. Extract a subset of the data for all chicks with complete information and name the data set complete. (Hint: you might want to use mutate to introduce a helper variable consisting of the number of observations)

answer

  1. In the complete data set introduce a new variable that measures the current weight difference compared to day 0. Name this variable weightgain. (Hint: use mutate and ?mutate to check the parameter .by. This parameter can create a temporary group_by so that we can do calculation in each subgroup, i.e. for each combination of chick and diet, weight - min(weight))

answer

  1. Use ggplot2 to create side-by-side boxplots of weightgain by Diet for day 21. Describe the relationship in 2-3 sentences. Change the order of the categories in the Diet variable such that the boxplots are ordered by median weightgain.

answer

Note: Your submission is supposed to be fully reproducible, i.e. the TA and I will ‘knit’ your submission in RStudio.

For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html (or Word) file with it.

(Optional but encouraged):
If you’d like to practice using GitHub, feel free to push your .Rmd and knitted .html file to a public GitHub repository under your own account. If you do, paste the link to your GitHub repo below:

GitHub repo link (optional): __________