Summarize

  1. Compute the mean arrival and departure delay overall, and per origin airport. What is the standard deviation of these variables? What is New York City’s busiest airport?

    flights %>%
      summarize(mean(___, na.rm = ___))
    
    flights %>%
      group_by(___) %>%
      summarize(___)
    
    flights %>%
      count(___) %>%
      arrange(___)
  2. Which carriers had the longest accumulated air time, excluding cancelled flights?

    flights %>%
      group_by(___) %>%
      summarize(acc_air_time = sum(_____)) %>% 
      ungroup()
  3. Plot a bar chart of the accumulated air time per airline, with a suitable unit for the total time.

    Hint: Use forcats::fct_inorder() to fix the ordering of a categorical variable before plotting.

    total_airtime_by_carrier <-
      _____
    
    total_airtime_by_carrier %>%
      arrange(acc_air_time) %>%
      mutate(carrier = forcats::fct_inorder(carrier)) %>%
      ggplot(aes(___)) +
        geom_col()
  4. Which carriers specialize on long-distance routes?

    flights
      _____ %>%
      _____ %>%
      _____ %>%
      arrange(___)
  5. Plot a bar chart of median distance per flight. Do you use geom_bar() or geom_col()? Why?

    median_miles_by_carrier <-
      _____ %>%
      _____ %>%
      _____
    
    median_miles_by_carrier %>%
      arrange(_____) %>%
      mutate(_____) %>%
      ggplot(aes(___)) +
        geom____()
  6. Which plane had the most failed departure attempts? Can you find a solution without filter()?

    Hint: Use the idiom sum(___) to count the rows where a predicate is true.

    flights %>%
      filter(is.na(dep_time)) %>%
      group_by(tailnum) %>%
      _____ %>%
      _____ %>%
      filter(!is.na(tailnum)) %>%
      arrange(desc(___)) %>% 
      head(1)
    
    # Alternative without filter():
    flights %>%
      group_by(tailnum) %>%
      _____ %>%
      _____ %>%
      arrange(_____) %>%
      head(1)
  7. Compute the ratio of short-distance routes (less than 300 miles) for each airline.

    Hint: Use the idiom mean(___) to compute the share of rows where a predicate is true.

    flights %>%
      group_by(carrier) %>%
      _____ %>% 
      ungroup()
  8. Plot a bar chart of the ratio of short-distance routes.

    short_distance_route_ratio <-
      _____
    
    short_distance_route_ratio %>%
      ggplot(aes(___)) +
        geom_col()
  9. Find more exercises in item 1 of Section 5.6.7 of r4ds.

Summarize with multiple variables

  1. Which relation is serviced by the largest number of distinct airlines? Find a solution using summarize(), one using count(), and one using tally(). Which is more elegant?

    flights %>%
      group_by(___, ___, airline) %>%
      summarize(n = n()) %>%
      summarize(n_airlines = ___) %>%
      ungroup() %>%
      arrange(___) %>%
      head(1)
    
    flights %>%
      count(_____) %>%
      count(_____) %>%
      _____ %>%
      _____
    
    flights %>%
      group_by(_____) %>%
      tally() %>%
      tally(wt = NULL) %>%
      _____ %>%
      _____
  2. Compute the share of cancelled flights per month per airline.

    flights %>% 
      group_by(_____) %>% 
      summarize(share_of_cancelled = _____) %>%
      ungroup()
  3. Create a heat map of cancelled flights.

    cancelled_flights <-
      _____
    
    cancelled_flights %>% 
      ggplot() +
      geom_raster(
        aes(
          x = ___,
          y = factor(month),
          fill = ___
        )
      )
  4. Find more exercises in Section 5.6.7 of r4ds.

Grouped mutate

  1. Which month is busiest in terms of miles flown, over all carriers?

    flights %>%
      group_by(___) %>%
      mutate(total_distance = sum(___)) %>%
      mutate(month_share = ___ / ___) %>% 
      arrange(desc(month_share)) %>%
      slice(1)
  2. Visualize with a bar chart.

  3. Which month is busiest in terms of miles flown, per carrier?

    Hint: Compute the share of yearly miles flown of each airline in each month.

    flights %>%
      group_by(___, ___) %>%
      summarize(total_distance_by_carrier = sum(distance)) %>%
      mutate(total_distance = sum(___)) %>%
      ungroup() %>%
      mutate(month_share_by_carrier = ___ / ___) %>% 
      arrange(month_share_by_carrier) %>% 
      group_by(___) %>%
      slice(1)
  4. Draw a heat map of miles flown per month per airline to see if this pattern holds across all airlines.

    monthly_shares <-
      _____
    
    monthly_shares %>%
      ggplot(aes(factor(month), ___, fill = ___)) +
      geom_tile() +
      scale_fill_continuous(trans = "log10")
  5. Find more exercises in Section 5.7.1 of r4ds.

Copyright © 2018 Kirill Müller. Licensed under CC BY-NC 4.0.