Work on questions in R, make sure to keep a copy of your R code - you will be asked to submit this script at the end.

Data on all flights in and out of Des Moines (DSM) for October 2008 are available at
See for a description of the variables.

Update: the link to the csv file above is not valid anymore. Please use the following code to load the data

flights <- read.csv("")
  1. Load the flights data into R.
    Determine, which flight was delayed the most on arrival. Report its row number, where it started, and by how much the flight was delayed on departure.

flights <- read.csv("")

# which flight was delayed the worst? - where did the flight start? 
# was it delayed when departing? 
## [1] 1516
# 1516

flights[which.max(flights$ArrDelay), c("Origin", "DepDelay")]
##      Origin DepDelay
## 1516    DSM      614
#     Origin DepDelay
#1516    DSM      614
  1. Bring the variable ‘Day’ into the correct order, starting with ‘Monday’.
##    Length     Class      Mode 
##      2687 character character
#   Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
#      456       368       248       336       460       360       459 

days <- levels(flights$Day)
flights$Day <- factor(flights$Day, levels=days[c(2,6,7,5,1,3,4)])

## NA's 
## 2687
#   Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
#      368       360       459       460       456       248       336 
  1. Create a new variable called ‘Weekend’ which has value TRUE for Saturdays and Sundays and FALSE otherwise.
# create new variable Weekend
flights$Weekend <- flights$Day %in% c("Saturday", "Sunday")

##    Mode   FALSE 
## logical    2687
#   Mode   FALSE    TRUE    NA's 
#logical    2103     584       0 
  1. Determine how many flights arrived in Des Moines on average each day of the week.
# idea 1:
table(subset(flights, Dest=="DSM")$Day) # overall number of flights by day of week
## < table of extent 0 >
#   Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
#      184       180       230       230       228       124       168 
# problem: how many Mondays, Tuesdays, are there in October 2008?

## Loading required package: lubridate
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##     date, intersect, setdiff, union
octs <- data.frame( date = ymd(paste("2008/10/",1:31, sep="")))
octs$day = wday(octs$date, label=TRUE)
## Sun Mon Tue Wed Thu Fri Sat 
##   4   4   4   5   5   5   4
table(subset(flights, Dest=="DSM")$Day)/c(4,4,5,5,5,4,4)
## numeric(0)
# idea 2: 
## Loading required package: dplyr
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##     filter, lag
## The following objects are masked from 'package:base':
##     intersect, setdiff, setequal, union
flights %>% filter(Dest == "DSM") %>% group_by(DayofMonth) %>% summarise(
  day = Day[1],
  n = n()
) %>% group_by(day) %>% summarize(avg = mean(n), n = n())
## # A tibble: 1 × 3
##   day     avg     n
##   <fct> <dbl> <int>
## 1 <NA>   43.4    31
  1. How many flights were scheduled to go to Denver (DEN)? What percentage of flights goes to Denver?
nrow(subset(flights, Dest=='DEN'))
## [1] 145
nrow(subset(flights, Dest=='DEN'))/nrow(subset(flights, Dest != 'DSM'))*100
## [1] 10.79672
  1. Where do most flights arriving in Des Moines come from? (use IATA code)

sort(table(flights$Origin), decreasing=T)[2]
## ORD 
## 379
  1. Plot boxplots of arrival delays by originating airports. Order boxplots according to increasing median arrival delay.

flights %>% filter(Dest =="DSM") %>%
ggplot( aes(x = reorder(factor(Origin), ArrDelay, na.rm=T), y = ArrDelay)) + geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (`stat_boxplot()`).

  1. Using dplyr, determine for flights leaving DSM for each hour of the day

Draw a scatterplot of average departure delay by scheduled hour of departure. Color points by top destination, adjust point size to reflect the number of flights for each hour.

dep.summary <- flights %>% filter(Origin == 'DSM') %>% mutate(hour = CRSDepTime%/%100) %>% group_by(hour) %>%
    count = n(),
    pct.delayed = sum(DepDelay>15, na.rm=TRUE)/n()*100,
    avg.delay = mean(DepDelay, na.rm=T),
    top.Dest.1=names(sort(table(Dest), decreasing=T))[1],
    top.Dest.2=names(sort(table(Dest), decreasing=T))[2],
    top.Dest.3=names(sort(table(Dest), decreasing=T))[3] 

dep.summary %>% 
  ggplot(aes(x = hour, avg.delay, colour = top.Dest.1, size = count)) + geom_point()