View the flights
dataset in RStudio’s data pane. Look up the meaning of the variables in the help.
Hint: You need to load the nycflights13
package.
View(___)
Find all flights that departed between 8:00 AM and 10:00 PM.
filter(flights, between(dep_time, ___, ___))
Find all flights that departed in the three winter months.
filter(flights, month ___ c(___)))
Are there any flights where departure time is later than arrival time? What does this mean?
filter(flights, ___)
Find all flights that departed today four years ago.
filter(flights, dep_time ___, dep_time ___)
Find more exercises in item 1 of Section 5.2.4 of r4ds.
View all flights that arrived after 10:00 PM. Use an intermediate variable, a nested expression, and the pipe. Which appeals more to you?
flights_after_10 <- filter(flights, ___)
View(flights_after_10)
View(filter(flights, ___))
flights %>%
filter(___) %>%
View()
Extend the four solutions to view all "UA"
flights that arrived after 10:00 PM.
flights_after_10 <- filter(flights, ___)
ua_flights_after_10 <- ...
View(___)
View(filter(filter(flights, ___)))
flights %>%
filter(___) %>%
filter(___) %>%
View()
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM.
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours.
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports.
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports but excluding Honolulu International airport.
Hint: Consult the airports
dataset.
Apply more restrictions to the four solutions.
Find all flights that departed today four years ago, flown by "US"
. Two simple solutions exist, which appeals most to you?
flights %>%
filter(___, ___) %>%
filter(___)
flights %>%
filter(_____)
Find all flights that departed before 6:00 AM or after 10:00 PM.
flights %>%
filter(___ | ___)
Find all flights not flown by either "UA"
or "WN"
. Can you think of three different solutions? Which appeals more to you?
flights %>%
filter(!___)
flights %>%
filter(!(___) ___ !(___))
Which flights have a missing departure or arrival time? Which have both missing? Can the number of flights that have a missing arrival but not departure time correspond to lost or crashed flights?
flights %>%
filter(is.na(___) ___)
flights %>%
filter(is.na(___) ___)
Find more exercises in item 4 of Section 5.2.4 of r4ds.
Plot a histogram of the air time of all flights. Exclude Honolulu International Airport in Hawaii to get rid of the peak at the right-hand side. Zoom into the flights that have an air time between 400 and 500 minutes.
Hint: Start with flights %>% ggplot() + ...
flights %>%
ggplot(___) +
___()
flights %>%
filter(___) %>%
ggplot(___) +
___()
flights %>%
filter(___) %>%
filter(___) %>%
___
Plot a heat map for all relations with an air time shorter than one hour.
Hint: Use geom_bin2d()
.
flights %>%
filter(___) %>%
ggplot(___) +
___()
Think of other plots of the flights
data that would not work if applied on the full dataset but are useful when applying a filter beforehand.
Find three ways to select the first five variables from the flights
dataset.
flights %>%
select(___, ___, ________)
flights %>%
select(___:___)
flights %>%
select(___:___)
Find three ways to exclude the date of the flight.
flights %>%
select(___, ___, ______________________)
flights %>%
select(-___, -___, -___)
flights %>%
select(-___:-___)
Select all variables related to departure.
flights %>%
select(___, ___, _______)
flights %>%
select(starts_with("___"))
Move the variables related to scheduled time to the end of the table.
flights %>%
select(-___, -___, _______, everything(), ___, ___)
Create a contour plot of departure and arrival time. Rename the columns to show prettily in the plot. Restrict the plot to all flights that arrive before 5:00 AM.
flights %>%
select(`___` = ___, `___` = ___, ___) %>%
filter(___) %>%
ggplot(aes(x = `___`, y = `___`)) +
geom_density2d()
Find more exercises in Section 5.4.1 of r4ds.
On what day did the flight with the shortest airtime take place?
Hint: Use head()
to restrict your result to one row only.
flights %>%
arrange(___) %>%
head(1)
Which flights had the heaviest delays? Can you use the tail()
verb to obtain this information?
flights %>%
arrange(___) %>%
tail(1)
flights %>%
arrange(desc(___)) %>%
___(1)
On what day did the flight with the longest airtime take place?
flights %>%
arrange(___ - ___) %>%
tail(1)
Find two equivalent ways to select the six "UA"
flights with the lowest delay. Which is faster? Why?
Hint: RStudio has shortcuts for swapping the current line with the next or previous line.
flights %>%
filter(___) %>%
arrange(___)
flights %>%
arrange(___) %>%
filter(___)
Which flights were best in recovering from delay in the air?
Find more exercises in Section 5.3.1 of r4ds.
Store the speed for each flight as miles per hour in a new variable. Visualize the speed distribution as a histogram. Would this visualization work without involving mutate()
?
flights %>%
mutate(speed = ___) %>%
ggplot(aes(___)) +
_____
flights %>%
ggplot(aes(___)) +
_____
Can you detect a difference in the speed distributions of on-time vs. delayed flights? Create a new variable that displays nicely in the legend or in the facet.
flights %>%
mutate(on_time = if_else(___ < 0, "On time", "delayed")) %>%
ggplot(aes(___)) +
_____
Visualize the deviation from the overall average departure delay for the three airports of New York City. Consider using a violin plot.
flights %>%
mutate(dep_delay_dev = ___ - mean(___)) %>%
ggplot(aes(___)) +
_____ +
_____
Find more exercises in Section 5.5.2 of r4ds.
Compute the mean arrival and departure delay overall, and per origin airport. What is the standard deviation of these variables? What is New York City’s busiest airport?
flights %>%
summarize(mean(___, na.rm = ___))
flights %>%
group_by() %>%
summarize(___)
flights %>%
count(___) %>%
arrange(___)
Which carriers had the longest accumulated air time, excluding cancelled flights? Plot a bar chart with a suitable unit for the total time.
Hint: Use forcats::fct_inorder()
to fix the ordering of a categorical variable before plotting.
total_airtime_by_carrier <-
flights %>%
group_by(___) %>%
summarize(acc_air_time = sum(_____))
total_airtime_by_carrier %>%
arrange(acc_air_time) %>%
mutate(carrier = forcats::fct_inorder(carrier)) %>%
ggplot(aes(___)) +
geom_bar()
Which carriers specialize on long-distance routes? Plot a bar chart similar to the previous exercise.
total_miles_by_carrier <-
_____ %>%
_____ %>%
_____
total_miles_by_carrier %>%
arrange(_____) %>%
mutate(_____) %>%
ggplot(aes(___)) +
geom_bar()
Which plane had the most failed departure attempts? Can you find a solution without filter()
?
Hint: Use the idiom sum(___)
to count the rows where a predicate is true.
flights %>%
filter(is.na(dep_time)) %>%
group_by(tailnum) %>%
_____ %>%
_____ %>%
head(1)
flights %>%
group_by(tailnum) %>%
_____ %>%
_____ %>%
head(1)
Compute the ratio of short-distance routes (less than 300 miles) for each airline. Plot a bar chart.
Hint: Use the idiom mean(___)
to compute the share of rows where a predicate is true.
flights %>%
group_by(carrier) %>%
_____ %>%
ggplot(aes(___)) +
geom_col()
Find more exercises in item 1 of Section 5.6.7 of r4ds.
Which relation is serviced by the largest number of distinct airlines? Find a solution using summarize()
, one using count()
, and one using tally()
. Which is more elegant?
flights %>%
group_by(___, ___, airline) %>%
summarize(n = n()) %>%
summarize(n_airlines = ___) %>%
ungroup() %>%
arrange(___) %>%
head(1)
flights %>%
count(_____) %>%
count(_____) %>%
_____ %>%
_____
flights %>%
group_by(_____) %>%
tally() %>%
tally(wt = NULL) %>%
_____ %>%
_____
Create a heat map for the share of cancelled flights per month per airline.
flights %>%
group_by(_____) %>%
summarize(share_of_cancelled = _____) %>%
ungroup() %>%
ggplot() +
geom_raster(
aes(
x = ___,
y = factor(month),
fill = ___
)
)
Find more exercises in Section 5.6.7 of r4ds.
Copyright © 2018 Kirill Müller. Licensed under CC BY-NC 4.0.