dplyr exercises part 2

Summarize

Compute the mean arrival and departure delay overall, and per origin airport. What is the standard deviation of these variables? What is New York City’s busiest airport?
```
flights %>%
  summarize(mean(___, na.rm = ___))

flights %>%
  group_by(___) %>%
  summarize(___)

flights %>%
  count(___) %>%
  arrange(___)
```

Which carriers had the longest accumulated air time, excluding cancelled flights?

flights %>%
  group_by(___) %>%
  summarize(acc_air_time = sum(_____)) %>% 
  ungroup()

Plot a bar chart of the accumulated air time per airline, with a suitable unit for the total time.

Hint: Use forcats::fct_inorder() to fix the ordering of a categorical variable before plotting.
```
total_airtime_by_carrier <-
  _____

total_airtime_by_carrier %>%
  arrange(acc_air_time) %>%
  mutate(carrier = forcats::fct_inorder(carrier)) %>%
  ggplot(aes(___)) +
    geom_col()
```

Which carriers specialize on long-distance routes?

flights
  _____ %>%
  _____ %>%
  _____ %>%
  arrange(___)

Plot a bar chart of median distance per flight. Do you use geom_bar() or geom_col()? Why?

median_miles_by_carrier <-
  _____ %>%
  _____ %>%
  _____

median_miles_by_carrier %>%
  arrange(_____) %>%
  mutate(_____) %>%
  ggplot(aes(___)) +
    geom____()

Which plane had the most failed departure attempts? Can you find a solution without filter()?

Hint: Use the idiom sum(___) to count the rows where a predicate is true.

flights %>%
  filter(is.na(dep_time)) %>%
  group_by(tailnum) %>%
  _____ %>%
  _____ %>%
  filter(!is.na(tailnum)) %>%
  arrange(desc(___)) %>% 
  head(1)

# Alternative without filter():
flights %>%
  group_by(tailnum) %>%
  _____ %>%
  _____ %>%
  arrange(_____) %>%
  head(1)

Compute the ratio of short-distance routes (less than 300 miles) for each airline.

Hint: Use the idiom mean(___) to compute the share of rows where a predicate is true.
```
flights %>%
  group_by(carrier) %>%
  _____ %>% 
  ungroup()
```

Plot a bar chart of the ratio of short-distance routes.

short_distance_route_ratio <-
  _____

short_distance_route_ratio %>%
  ggplot(aes(___)) +
    geom_col()

Find more exercises in item 1 of Section 5.6.7 of r4ds.

Summarize with multiple variables

Which relation is serviced by the largest number of distinct airlines? Find a solution using summarize(), one using count(), and one using tally(). Which is more elegant?

flights %>%
  group_by(___, ___, airline) %>%
  summarize(n = n()) %>%
  summarize(n_airlines = ___) %>%
  ungroup() %>%
  arrange(___) %>%
  head(1)

flights %>%
  count(_____) %>%
  count(_____) %>%
  _____ %>%
  _____

flights %>%
  group_by(_____) %>%
  tally() %>%
  tally(wt = NULL) %>%
  _____ %>%
  _____

Compute the share of cancelled flights per month per airline.

flights %>% 
  group_by(_____) %>% 
  summarize(share_of_cancelled = _____) %>%
  ungroup()

Create a heat map of cancelled flights.

cancelled_flights <-
  _____

cancelled_flights %>% 
  ggplot() +
  geom_raster(
    aes(
      x = ___,
      y = factor(month),
      fill = ___
    )
  )

Find more exercises in Section 5.6.7 of r4ds.

Grouped mutate

Which month is busiest in terms of miles flown, over all carriers?

flights %>%
  group_by(___) %>%
  mutate(total_distance = sum(___)) %>%
  mutate(month_share = ___ / ___) %>% 
  arrange(desc(month_share)) %>%
  slice(1)

Visualize with a bar chart.

Which month is busiest in terms of miles flown, per carrier?

Hint: Compute the share of yearly miles flown of each airline in each month.

flights %>%
  group_by(___, ___) %>%
  summarize(total_distance_by_carrier = sum(distance)) %>%
  mutate(total_distance = sum(___)) %>%
  ungroup() %>%
  mutate(month_share_by_carrier = ___ / ___) %>% 
  arrange(month_share_by_carrier) %>% 
  group_by(___) %>%
  slice(1)

Draw a heat map of miles flown per month per airline to see if this pattern holds across all airlines.

monthly_shares <-
  _____

monthly_shares %>%
  ggplot(aes(factor(month), ___, fill = ___)) +
  geom_tile() +
  scale_fill_continuous(trans = "log10")

Find more exercises in Section 5.7.1 of r4ds.

dplyr exercises part 2

Kirill Müller, cynkra GmbH

Summarize

Summarize with multiple variables

Grouped mutate