Compute the mean arrival and departure delay overall, and per origin airport. What is the standard deviation of these variables? What is New York City’s busiest airport?
flights %>%
summarize(mean(___, na.rm = ___))
flights %>%
group_by(___) %>%
summarize(___)
flights %>%
count(___) %>%
arrange(___)
Which carriers had the longest accumulated air time, excluding cancelled flights?
flights %>%
group_by(___) %>%
summarize(acc_air_time = sum(_____)) %>%
ungroup()
Plot a bar chart of the accumulated air time per airline, with a suitable unit for the total time.
Hint: Use forcats::fct_inorder()
to fix the ordering of a categorical variable before plotting.
total_airtime_by_carrier <-
_____
total_airtime_by_carrier %>%
arrange(acc_air_time) %>%
mutate(carrier = forcats::fct_inorder(carrier)) %>%
ggplot(aes(___)) +
geom_col()
Which carriers specialize on long-distance routes?
flights
_____ %>%
_____ %>%
_____ %>%
arrange(___)
Plot a bar chart of median distance per flight. Do you use geom_bar()
or geom_col()
? Why?
median_miles_by_carrier <-
_____ %>%
_____ %>%
_____
median_miles_by_carrier %>%
arrange(_____) %>%
mutate(_____) %>%
ggplot(aes(___)) +
geom____()
Which plane had the most failed departure attempts? Can you find a solution without filter()
?
Hint: Use the idiom sum(___)
to count the rows where a predicate is true.
flights %>%
filter(is.na(dep_time)) %>%
group_by(tailnum) %>%
_____ %>%
_____ %>%
filter(!is.na(tailnum)) %>%
arrange(desc(___)) %>%
head(1)
# Alternative without filter():
flights %>%
group_by(tailnum) %>%
_____ %>%
_____ %>%
arrange(_____) %>%
head(1)
Compute the ratio of short-distance routes (less than 300 miles) for each airline.
Hint: Use the idiom mean(___)
to compute the share of rows where a predicate is true.
flights %>%
group_by(carrier) %>%
_____ %>%
ungroup()
Plot a bar chart of the ratio of short-distance routes.
short_distance_route_ratio <-
_____
short_distance_route_ratio %>%
ggplot(aes(___)) +
geom_col()
Find more exercises in item 1 of Section 5.6.7 of r4ds.
Which relation is serviced by the largest number of distinct airlines? Find a solution using summarize()
, one using count()
, and one using tally()
. Which is more elegant?
flights %>%
group_by(___, ___, airline) %>%
summarize(n = n()) %>%
summarize(n_airlines = ___) %>%
ungroup() %>%
arrange(___) %>%
head(1)
flights %>%
count(_____) %>%
count(_____) %>%
_____ %>%
_____
flights %>%
group_by(_____) %>%
tally() %>%
tally(wt = NULL) %>%
_____ %>%
_____
Compute the share of cancelled flights per month per airline.
flights %>%
group_by(_____) %>%
summarize(share_of_cancelled = _____) %>%
ungroup()
Create a heat map of cancelled flights.
cancelled_flights <-
_____
cancelled_flights %>%
ggplot() +
geom_raster(
aes(
x = ___,
y = factor(month),
fill = ___
)
)
Find more exercises in Section 5.6.7 of r4ds.
Which month is busiest in terms of miles flown, over all carriers?
flights %>%
group_by(___) %>%
mutate(total_distance = sum(___)) %>%
mutate(month_share = ___ / ___) %>%
arrange(desc(month_share)) %>%
slice(1)
Visualize with a bar chart.
Which month is busiest in terms of miles flown, per carrier?
Hint: Compute the share of yearly miles flown of each airline in each month.
flights %>%
group_by(___, ___) %>%
summarize(total_distance_by_carrier = sum(distance)) %>%
mutate(total_distance = sum(___)) %>%
ungroup() %>%
mutate(month_share_by_carrier = ___ / ___) %>%
arrange(month_share_by_carrier) %>%
group_by(___) %>%
slice(1)
Draw a heat map of miles flown per month per airline to see if this pattern holds across all airlines.
monthly_shares <-
_____
monthly_shares %>%
ggplot(aes(factor(month), ___, fill = ___)) +
geom_tile() +
scale_fill_continuous(trans = "log10")
Find more exercises in Section 5.7.1 of r4ds.
Copyright © 2018 Kirill Müller. Licensed under CC BY-NC 4.0.