Filtering

  1. View the flights dataset in RStudio’s data pane. Look up the meaning of the variables in the help.

    Hint: You need to load the nycflights13 package.

    View(___)
  2. Find all flights that departed between 8:00 AM and 10:00 PM.

    flights %>%
      filter(between(dep_time, ___, ___))
  3. Find all flights that departed in the three winter months.

    flights %>%
      filter(month ___ c(___)))
  4. Are there any flights where departure time is later than arrival time? What does this mean?

    flights %>%
      filter(_____)
  5. Find all flights that departed today 5 years ago.

    flights %>%
      filter(month ___, day ___)

Arrange

  1. On what day did the flight with the shortest airtime take place?

    Hint: Use head() to restrict your result to one row only.

    flights %>% 
      arrange(___) %>%
      head(1)
  2. Which flights had the heaviest delays? Can you use the tail() verb to obtain this information?

    flights %>% 
      arrange(___) %>%
      tail(1)
    
    flights %>% 
      arrange(desc(___)) %>%
      ___(1)
  3. On what day did the flight with the longest airtime take place?

    flights %>% 
      arrange(___ - ___) %>%
      head(1)
  4. Find two equivalent ways to select the six "UA" flights with the lowest delay. Which is faster? Why?

    Hint: RStudio has shortcuts for swapping the current line with the next or previous line.

    flights %>%
      filter(___) %>%
      arrange(___)
    
    flights %>%
      arrange(___) %>%
      filter(___)
  5. Which flights were best in recovering from delay in the air?

Assignment, the pipe

  1. View all flights that arrived after 10:00 PM. Use an intermediate variable, a nested expression, and the pipe. Which appeals more to you?

    flights_after_10 <- filter(flights, ___)
    View(flights_after_10)
    View(filter(flights, ___))
    flights %>%
      filter(___) %>%
      View()
  2. Extend the four solutions to view all "UA" flights that arrived after 10:00 PM.

    flights_after_10 <- filter(flights, ___)
    ua_flights_after_10 <- ...
    View(___)
    View(filter(filter(flights, ___)))
    flights %>%
      filter(___) %>%
      filter(___) %>%
      View()
  3. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM.

  4. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours.

  5. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports.

  6. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports but excluding Honolulu International airport.

    Hint: Consult the airports dataset, use a filter with the predicate stringr::str_detect(name, "^Honolulu") .

  7. Sort the result by distance.

Filtering and plotting

  1. Plot a histogram of the air time of all flights. Exclude Honolulu International Airport in Hawaii to get rid of the peak at the right-hand side. Zoom into the flights that have an air time between 400 and 500 minutes.

    Hint: Start with flights %>% ggplot() + ...

    flights %>% 
      ggplot(___) +
        ___()
    
    flights %>% 
      filter(___) %>%
      ggplot(___) +
        ___()
    
    flights %>% 
      filter(___) %>%
      filter(___) %>%
      ___
  2. Plot a heat map for all relations with an air time shorter than one hour.

    Hint: Use geom_bin2d().

    flights %>% 
      filter(___) %>%
      ggplot(___) +
        ___()
  3. Think of other plots of the flights data that would not work if applied on the full dataset but are useful when applying a filter beforehand.

Combining filters

  1. Look at the “Details” section in the help page for | with help("|") to understand predicate logic in R. (We need element-wise comparisons.)

  2. Find all flights that departed today x years ago, flown by "US". Two simple solutions exist, which appeals most to you?

    flights %>%
      filter(___, ___) %>%
      filter(___)
    
    flights %>%
      filter(_____)
  3. Find all flights that departed before 6:00 AM or after 10:00 PM.

    flights %>%
      filter(___ | ___)
  4. Find all flights not flown by either "UA" or "WN". Can you think of three different solutions? Which appeals more to you?

    flights %>%
      filter(___ ___ ___)
    
    flights %>%
      filter(!(___) ___ !(___))
    
    flights %>%
      filter(!(_____))
  5. Which flights have a missing departure or arrival time? Which have both missing? Can the number of flights that have a missing arrival but not departure time correspond to lost or crashed flights?

    flights %>%
      filter(is.na(___))
    
    flights %>%
      filter(___(___) ___ _____)
    
    flights %>%
      filter(_____ ___ !_____)
  6. Find two equivalent ways to select the six "UA" flights with the lowest delay. Which is faster? Why?

    Hint: RStudio has shortcuts for swapping the current line with the next or previous line.

    flights %>%
      filter(___) %>%
      arrange(___)
    
    flights %>%
      arrange(___) %>%
      filter(___)
  7. Find more exercises in items 1 and 4 of Section 5.2.4, and in Section 5.3.1, of r4ds.

Select and rename

  1. Find three ways to select the first five variables from the flights dataset.

    flights %>% 
      select(___, ___, ________)
    
    flights %>% 
      select(___:___)
    
    flights %>% 
      select(___:___)
  2. Find three ways to exclude the date of the flight.

    flights %>% 
      select(___, ___, ______________________)
    
    flights %>% 
      select(-___, -___, -___)
    
    flights %>% 
      select(-___:-___)
  3. Select all variables related to departure.

    flights %>% 
      select(___, ___, _______)
    
    flights %>% 
      select(starts_with("___"))
  4. Move the variables related to scheduled time to the end of the table.

    flights %>% 
      select(-___, -___, _______, everything(), ___, ___)
  5. Create a contour plot of departure and arrival time. Rename the columns to show prettily in the plot. Restrict the plot to all flights that arrive before 5:00 AM.

    flights %>%
      select(`___` = ___, `___` = ___, ___) %>%
      filter(___) %>%
      ggplot(aes(x = `___`, y = `___`)) +
      geom_density2d()
  6. Find more exercises in Section 5.4.1 of r4ds.

Mutate

  1. Store the speed for each flight as miles per hour in a new variable.

    flights %>% 
      mutate(miles_per_hour = air_time ___ distance ___ ___) %>%
      ggplot(aes(___)) +
      _____
  2. Can you use an intermediate variable to clarify the intent? How do you remove the intermediate variable?

    flights %>% 
      mutate(miles_per_minute = _____) %>% 
      mutate(miles_per_hour = _____) %>% 
      select(_____)
  3. Visualize the speed distribution as a histogram. Would this visualization work without involving mutate()?

    flights %>% 
      ______ %>%
      ggplot(aes(___)) +
      _____
    
    # Alternative:
    flights %>%
      ggplot(aes(___)) +
      _____
  4. Create a new logical variable that indicates if the flight arrived on time.

    flights %>%
      mutate(on_time = (___ <= 0))
  5. Visualize the aggregated on-time status per airline with a useful text.

    flights %>%
      flights %>%
      mutate(
        on_time = _____,
        on_time_desc = if_else(___, "On time", ___)
      ) %>%
      ggplot(aes(___)) +
      geom_bar()
  6. Can you detect a difference in the speed distributions of on-time vs. delayed flights? Ose color of faceting.

    speed_and_on_time_info <-
      _____
    
    speed_and_on_time_info %>%
      ggplot() +
      geom_freqpoly(
        aes(x = ___, y = ..density.., color = ___),
        na.rm = TRUE,
        binwidth = 20
      )
    
    speed_and_on_time_info %>%
      ggplot() +
      geom_histogram(
        aes(x = ___),
        na.rm = TRUE,
        binwidth = 20
      ) +
      facet_wrap(~___, ncol = 1)
  7. Create two new variables date_hour and date_ymd, using as.Date() or lubridate::make_date(), respectively. Are the two values the same for all observations? What happens if we omit the tz argument to as.Date()?

    flights %>%
      mutate(
        ___ = as.Date(___, tz = "EST"),
        ___ = lubridate::make_date(_____)
      ) %>% 
      filter(___)
  8. Visualize the deviation from the overall average departure delay for the three airports of New York City. Consider using a violin plot.

    flights %>%
      mutate(dep_delay_dev = ___ - mean(___)) %>%
      ggplot(aes(___)) +
      _____ +
      _____
  9. Find more exercises in Section 5.5.2 of r4ds.

Copyright © 2018 Kirill Müller. Licensed under CC BY-NC 4.0.