Flights on the same date as today

Find all flights that departed today x years ago, flown by "US". Two simple solutions exist, which appeals most to you?

flights %>%
  filter(___, ___) %>%
  filter(___)
  
flights %>%
  filter(_____)

► Solution:

flights %>%
  filter(day == 13, month == 11) %>%
  filter(carrier == "US")
## # A tibble: 66 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11    13      458            500        -2      632            651
##  2  2013    11    13      551            600        -9      638            658
##  3  2013    11    13      553            600        -7      650            659
##  4  2013    11    13      606            615        -9      741            818
##  5  2013    11    13      607            615        -8      758            817
##  6  2013    11    13      622            630        -8      818            833
##  7  2013    11    13      625            630        -5      824            840
##  8  2013    11    13      626            630        -4      915           1018
##  9  2013    11    13      631            600        31      739            701
## 10  2013    11    13      653            700        -7      753            809
## # … with 56 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
  filter(day == 13 & month == 11 & carrier == "US")
## # A tibble: 66 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11    13      458            500        -2      632            651
##  2  2013    11    13      551            600        -9      638            658
##  3  2013    11    13      553            600        -7      650            659
##  4  2013    11    13      606            615        -9      741            818
##  5  2013    11    13      607            615        -8      758            817
##  6  2013    11    13      622            630        -8      818            833
##  7  2013    11    13      625            630        -5      824            840
##  8  2013    11    13      626            630        -4      915           1018
##  9  2013    11    13      631            600        31      739            701
## 10  2013    11    13      653            700        -7      753            809
## # … with 56 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Red eye flights

Find all flights that departed before 6:00 AM or after 10:00 PM.

flights %>%
  filter(___ | ___)

► Solution:

flights %>%
  filter(dep_time < 600 | dep_time > 2200)
## # A tibble: 16,753 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 16,743 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Flights not by UA or WN

Find all flights not flown by either "UA" or "WN". Can you think of three different solutions? Which appeals more to you?

flights %>%
  filter(___ ___ ___)

flights %>%
  filter(!(___) ___ !(___))

flights %>%
  filter(!(_____))

► Solution:

flights %>%
  filter(carrier != "UA" & carrier != "WN")
## # A tibble: 265,836 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      554            600        -6      812            837
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      709            723
##  6  2013     1     1      557            600        -3      838            846
##  7  2013     1     1      558            600        -2      753            745
##  8  2013     1     1      558            600        -2      849            851
##  9  2013     1     1      558            600        -2      853            856
## 10  2013     1     1      559            600        -1      941            910
## # … with 265,826 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
  filter(!(carrier == "UA" | carrier == "WN"))
## # A tibble: 265,836 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      554            600        -6      812            837
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      709            723
##  6  2013     1     1      557            600        -3      838            846
##  7  2013     1     1      558            600        -2      753            745
##  8  2013     1     1      558            600        -2      849            851
##  9  2013     1     1      558            600        -2      853            856
## 10  2013     1     1      559            600        -1      941            910
## # … with 265,826 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
  filter(!(carrier %in% c("UA", "WN")))
## # A tibble: 265,836 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      554            600        -6      812            837
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      709            723
##  6  2013     1     1      557            600        -3      838            846
##  7  2013     1     1      558            600        -2      753            745
##  8  2013     1     1      558            600        -2      849            851
##  9  2013     1     1      558            600        -2      853            856
## 10  2013     1     1      559            600        -1      941            910
## # … with 265,826 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Missing data

Which flights have a missing departure or arrival time? Which have both missing? Can the number of flights that have a missing arrival but not departure time correspond to lost or crashed flights?

flights %>%
  filter(is.na(___))

flights %>%
  filter(___(___) ___ _____)

flights %>%
  filter(_____ ___ !_____)

► Solution:

flights %>%
  filter(is.na(arr_time))
## # A tibble: 8,713 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     2016           1930        46       NA           2220
##  2  2013     1     1       NA           1630        NA       NA           1815
##  3  2013     1     1       NA           1935        NA       NA           2240
##  4  2013     1     1       NA           1500        NA       NA           1825
##  5  2013     1     1       NA            600        NA       NA            901
##  6  2013     1     2     2041           2045        -4       NA           2359
##  7  2013     1     2     2145           2129        16       NA             33
##  8  2013     1     2       NA           1540        NA       NA           1747
##  9  2013     1     2       NA           1620        NA       NA           1746
## 10  2013     1     2       NA           1355        NA       NA           1459
## # … with 8,703 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
  filter(is.na(arr_time) & is.na(dep_time))
## # A tibble: 8,255 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # … with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
  filter(is.na(arr_time) & !is.na(dep_time))
## # A tibble: 458 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     2016           1930        46       NA           2220
##  2  2013     1     2     2041           2045        -4       NA           2359
##  3  2013     1     2     2145           2129        16       NA             33
##  4  2013     1     9      615            615         0       NA            855
##  5  2013     1     9     2042           2040         2       NA           2357
##  6  2013     1    11     1344           1350        -6       NA           1518
##  7  2013     1    13     1907           1634       153       NA           1837
##  8  2013     1    13     2239           2159        40       NA             30
##  9  2013     1    16      837            840        -3       NA           1030
## 10  2013     1    25     1452           1500        -8       NA           1619
## # … with 448 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Commutativity

Find two equivalent ways to select the six "UA" flights with the lowest delay. Which is faster? Why?

Hint: RStudio has shortcuts for swapping the current line with the next or previous line.

flights %>%
  filter(___) %>%
  arrange(___)

flights %>%
  arrange(___) %>%
  filter(___)

► Solution: Arranging a smaller subset needs less work.

system.time(
  flights %>%
    arrange(arr_delay) %>% 
    filter(carrier == "UA")
)
##    user  system elapsed 
##   0.217   0.000   0.216
system.time(
  flights %>%
    filter(carrier == "UA") %>%
    arrange(arr_delay)
)
##    user  system elapsed 
##   0.042   0.000   0.042

Understand predicate logic

Look at the “Details” section in the help page for | with help("|") to understand predicate logic in R. (We need element-wise comparisons.)

More exercises

Find more exercises in items 1 and 4 of Section 5.2.4, and in Section 5.3.1, of r4ds.

Copyright © 2019 Kirill Müller. Licensed under CC BY-NC 4.0.