View the flights
dataset in RStudio’s data pane. Look up the meaning of the variables in the help.
Hint: You need to load the nycflights13
package.
View(___)
Find all flights that departed between 8:00 AM and 10:00 PM.
flights %>%
filter(between(dep_time, ___, ___))
Find all flights that departed in the three winter months.
flights %>%
filter(month ___ c(___)))
Are there any flights where departure time is later than arrival time? What does this mean?
flights %>%
filter(_____)
Find all flights that departed today 5 years ago.
flights %>%
filter(month ___, day ___)
On what day did the flight with the shortest airtime take place?
Hint: Use head()
to restrict your result to one row only.
flights %>%
arrange(___) %>%
head(1)
Which flights had the heaviest delays? Can you use the tail()
verb to obtain this information?
flights %>%
arrange(___) %>%
tail(1)
flights %>%
arrange(desc(___)) %>%
___(1)
On what day did the flight with the longest airtime take place?
flights %>%
arrange(___ - ___) %>%
head(1)
Find two equivalent ways to select the six "UA"
flights with the lowest delay. Which is faster? Why?
Hint: RStudio has shortcuts for swapping the current line with the next or previous line.
flights %>%
filter(___) %>%
arrange(___)
flights %>%
arrange(___) %>%
filter(___)
Which flights were best in recovering from delay in the air?
View all flights that arrived after 10:00 PM. Use an intermediate variable, a nested expression, and the pipe. Which appeals more to you?
flights_after_10 <- filter(flights, ___)
View(flights_after_10)
View(filter(flights, ___))
flights %>%
filter(___) %>%
View()
Extend the four solutions to view all "UA"
flights that arrived after 10:00 PM.
flights_after_10 <- filter(flights, ___)
ua_flights_after_10 <- ...
View(___)
View(filter(filter(flights, ___)))
flights %>%
filter(___) %>%
filter(___) %>%
View()
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM.
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours.
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports.
Extend the four solutions to view all "UA"
flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports but excluding Honolulu International airport.
Hint: Consult the airports
dataset, use a filter with the predicate stringr::str_detect(name, "^Honolulu")
.
Sort the result by distance
.
Plot a histogram of the air time of all flights. Exclude Honolulu International Airport in Hawaii to get rid of the peak at the right-hand side. Zoom into the flights that have an air time between 400 and 500 minutes.
Hint: Start with flights %>% ggplot() + ...
flights %>%
ggplot(___) +
___()
flights %>%
filter(___) %>%
ggplot(___) +
___()
flights %>%
filter(___) %>%
filter(___) %>%
___
Plot a heat map for all relations with an air time shorter than one hour.
Hint: Use geom_bin2d()
.
flights %>%
filter(___) %>%
ggplot(___) +
___()
Think of other plots of the flights
data that would not work if applied on the full dataset but are useful when applying a filter beforehand.
Look at the “Details” section in the help page for |
with help("|")
to understand predicate logic in R. (We need element-wise comparisons.)
Find all flights that departed today x years ago, flown by "US"
. Two simple solutions exist, which appeals most to you?
flights %>%
filter(___, ___) %>%
filter(___)
flights %>%
filter(_____)
Find all flights that departed before 6:00 AM or after 10:00 PM.
flights %>%
filter(___ | ___)
Find all flights not flown by either "UA"
or "WN"
. Can you think of three different solutions? Which appeals more to you?
flights %>%
filter(___ ___ ___)
flights %>%
filter(!(___) ___ !(___))
flights %>%
filter(!(_____))
Which flights have a missing departure or arrival time? Which have both missing? Can the number of flights that have a missing arrival but not departure time correspond to lost or crashed flights?
flights %>%
filter(is.na(___))
flights %>%
filter(___(___) ___ _____)
flights %>%
filter(_____ ___ !_____)
Find two equivalent ways to select the six "UA"
flights with the lowest delay. Which is faster? Why?
Hint: RStudio has shortcuts for swapping the current line with the next or previous line.
flights %>%
filter(___) %>%
arrange(___)
flights %>%
arrange(___) %>%
filter(___)
Find more exercises in items 1 and 4 of Section 5.2.4, and in Section 5.3.1, of r4ds.
Find three ways to select the first five variables from the flights
dataset.
flights %>%
select(___, ___, ________)
flights %>%
select(___:___)
flights %>%
select(___:___)
Find three ways to exclude the date of the flight.
flights %>%
select(___, ___, ______________________)
flights %>%
select(-___, -___, -___)
flights %>%
select(-___:-___)
Select all variables related to departure.
flights %>%
select(___, ___, _______)
flights %>%
select(starts_with("___"))
Move the variables related to scheduled time to the end of the table.
flights %>%
select(-___, -___, _______, everything(), ___, ___)
Create a contour plot of departure and arrival time. Rename the columns to show prettily in the plot. Restrict the plot to all flights that arrive before 5:00 AM.
flights %>%
select(`___` = ___, `___` = ___, ___) %>%
filter(___) %>%
ggplot(aes(x = `___`, y = `___`)) +
geom_density2d()
Find more exercises in Section 5.4.1 of r4ds.
Store the speed for each flight as miles per hour in a new variable.
flights %>%
mutate(miles_per_hour = air_time ___ distance ___ ___) %>%
ggplot(aes(___)) +
_____
Can you use an intermediate variable to clarify the intent? How do you remove the intermediate variable?
flights %>%
mutate(miles_per_minute = _____) %>%
mutate(miles_per_hour = _____) %>%
select(_____)
Visualize the speed distribution as a histogram. Would this visualization work without involving mutate()
?
flights %>%
______ %>%
ggplot(aes(___)) +
_____
# Alternative:
flights %>%
ggplot(aes(___)) +
_____
Create a new logical variable that indicates if the flight arrived on time.
flights %>%
mutate(on_time = (___ <= 0))
Visualize the aggregated on-time status per airline with a useful text.
flights %>%
flights %>%
mutate(
on_time = _____,
on_time_desc = if_else(___, "On time", ___)
) %>%
ggplot(aes(___)) +
geom_bar()
Can you detect a difference in the speed distributions of on-time vs. delayed flights? Ose color of faceting.
speed_and_on_time_info <-
_____
speed_and_on_time_info %>%
ggplot() +
geom_freqpoly(
aes(x = ___, y = ..density.., color = ___),
na.rm = TRUE,
binwidth = 20
)
speed_and_on_time_info %>%
ggplot() +
geom_histogram(
aes(x = ___),
na.rm = TRUE,
binwidth = 20
) +
facet_wrap(~___, ncol = 1)
Create two new variables date_hour
and date_ymd
, using as.Date()
or lubridate::make_date()
, respectively. Are the two values the same for all observations? What happens if we omit the tz
argument to as.Date()
?
flights %>%
mutate(
___ = as.Date(___, tz = "EST"),
___ = lubridate::make_date(_____)
) %>%
filter(___)
Visualize the deviation from the overall average departure delay for the three airports of New York City. Consider using a violin plot.
flights %>%
mutate(dep_delay_dev = ___ - mean(___)) %>%
ggplot(aes(___)) +
_____ +
_____
Find more exercises in Section 5.5.2 of r4ds.
Copyright © 2018 Kirill Müller. Licensed under CC BY-NC 4.0.