Filtering

  1. View the flights dataset in RStudio’s data pane. Look up the meaning of the variables in the help.

    Hint: You need to load the nycflights13 package.

  2. Find all flights that departed today four years ago.

  3. Find all flights that departed between 8:00 AM and 10:00 PM.

  4. Find all flights that departed in the three winter months.

  5. Are there any flights where departure time is later than arrival time? What does this mean?

  6. Find more exercises in item 1 of Section 5.2.4 of r4ds.

Assignment, the pipe

  1. View all flights that arrived after 10:00 PM. Use a combined filter, an intermediate variable, a nested expression, and the pipe. Which appeals more to you?

  2. Extend the four solutions to view all "UA" flights that arrived after 10:00 PM.

  3. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM.

  4. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours.

  5. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports.

  6. Extend the four solutions to view all "UA" flights that departed before 6:00 AM and arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City’s airports but excluding Honolulu International airport.

    Hint: Consult the airports dataset.

  7. Apply more restrictions to the four solutions.

Logical operators, NA

  1. Find all flights that departed today four years ago, flown by "US". Two simple solutions exist, which appeals most to you?

  2. Find all flights that departed before 6:00 AM or after 10:00 PM.

  3. Find all flights not flown by either "UA" or "WN". Can you think of three different solutions? Which appeals more to you?

  4. Which flights have a missing departure or arrival time? Which have both missing? Can the number of flights that have a missing arrival but not departure time correspond to lost or crashed flights?

  5. Find more exercises in item 4 of Section 5.2.4 of r4ds.

Filtering and plotting

  1. Plot a histogram of the air time of all flights. Exclude Honolulu International Airport in Hawaii to get rid of the peak at the right-hand side. Zoom into the flights that have an air time between 400 and 500 minutes.

    Hint: Start with flights %>% ggplot() + ...

  2. Plot a heat map for all relations with an air time shorter than one hour.

    Hint: Use geom_bin2d().

  3. Think of other plots of the flights data that would not work if applied on the full dataset but are useful when applying a filter beforehand.

Arrange

  1. On what day did the flight with the shortest airtime take place?

    Hint: Use head() to restrict your result to one row only.

  2. Which flights had the heaviest delays? Can you use the tail() verb to obtain this information?

  3. On what day did the flight with the longest airtime take place?

  4. Find two equivalent ways to select the six "UA" flights with the lowest delay. Which is faster? Why?

    Hint: RStudio has shortcuts for swapping the current line with the next or previous line.

  5. Which flights were best in recovering from delay in the air?

  6. Find more exercises in Section 5.3.1 of r4ds.

Select and rename

  1. Find three ways to select the first five variables from the flights dataset.

  2. Find three ways to exclude the date of the flight.

  3. Select all variables related to departure.

  4. Move the variables related to scheduled time to the end of the table.

  5. Create a contour plot of departure and arrival time. Use two different techniques to set pretty axis labels. Then, restrict the plot to all flights that arrive before 5:00 AM. How do you fix the aspect ratio of the plot?

    Hint: Use geom_density2d()

  6. Find more exercises in Section 5.4.1 of r4ds.

Mutate

  1. Store the speed for each flight as miles per hour in a new variable. Visualize the speed distribution as a histogram. Would this visualization work without involving mutate()?

  2. Can you detect a difference in the speed distributions of on-time vs. delayed flights? Create a new variable that displays nicely in the legend or in the facet.

  3. Visualize the deviation from the average departure delay for the three airports of New York City. Consider using a violin plot.

  4. Find more exercises in Section 5.5.2 of r4ds.

Summarize

  1. Compute the mean arrival and departure delay overall, and per origin airport. What is the standard deviation of these variables? What is New York City’s busiest airport?

  2. Which carriers had the longest accumulated air time? Plot a bar chart with a suitable unit for the total time.

    Hint: Use forcats::fct_inorder() to fix the ordering of a categorical variable before plotting.

  3. Which carriers specialize on long-distance routes? Plot a bar chart similar to the previous exercise.

  4. Which plane had the most failed departure attempts?

  5. Compute the ratio of short-distance routes (less than 300 miles) for each airline.

  6. Find more exercises in item 1 of Section 5.6.7 of r4ds.

Summarize with multiple variables

  1. Which destination airport is serviced by the largest number of distinct airlines? Find a solution using summarize(), and one using count(). Which is more elegant?

  2. Create a heat map for the share of cancelled flights per month per airline.

    Hint: Use geom_raster()

  3. Find more exercises in Section 5.6.7 of r4ds.

Grouped mutate

  1. Which month is busiest in terms of miles flown? Draw a heat map to see if this pattern holds across all airlines.

    Hint: Compute the share of yearly miles flown of each airline in each month.

  2. Compute the ground time for each airplane on any given day. Make sure that you add only positive numbers. Visualize the distribution of ground times by airline.

    Hint: Use lag() or lead().

  3. Find more exercises in Section 5.7.1 of r4ds.

Copyright © 2017 Kirill Müller. Licensed under CC BY-NC 4.0.