Flights on the same date as today

Find all flights that departed today x years ago, flown by "US". Two simple solutions exist, which appeals most to you?

flights %>%
  filter(___, ___) %>%
  filter(___)
  
flights %>%
  filter(_____)

► Solution:

flights %>%
  filter(day == 13, month == 11) %>%
  filter(carrier == "US")
## # A tibble: 66 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    11    13      458            500        -2      632
##  2  2013    11    13      551            600        -9      638
##  3  2013    11    13      553            600        -7      650
##  4  2013    11    13      606            615        -9      741
##  5  2013    11    13      607            615        -8      758
##  6  2013    11    13      622            630        -8      818
##  7  2013    11    13      625            630        -5      824
##  8  2013    11    13      626            630        -4      915
##  9  2013    11    13      631            600        31      739
## 10  2013    11    13      653            700        -7      753
## # ... with 56 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>%
  filter(day == 13 & month == 11 & carrier == "US")
## # A tibble: 66 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    11    13      458            500        -2      632
##  2  2013    11    13      551            600        -9      638
##  3  2013    11    13      553            600        -7      650
##  4  2013    11    13      606            615        -9      741
##  5  2013    11    13      607            615        -8      758
##  6  2013    11    13      622            630        -8      818
##  7  2013    11    13      625            630        -5      824
##  8  2013    11    13      626            630        -4      915
##  9  2013    11    13      631            600        31      739
## 10  2013    11    13      653            700        -7      753
## # ... with 56 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Red eye flights

Find all flights that departed before 6:00 AM or after 10:00 PM.

flights %>%
  filter(___ | ___)

► Solution:

flights %>%
  filter(dep_time < 600 | dep_time > 2200)
## # A tibble: 16,753 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 16,743 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Flights not by UA or WN

Find all flights not flown by either "UA" or "WN". Can you think of three different solutions? Which appeals more to you?

flights %>%
  filter(___ ___ ___)

flights %>%
  filter(!(___) ___ !(___))

flights %>%
  filter(!(_____))

► Solution:

flights %>%
  filter(carrier != "UA" & carrier != "WN")
## # A tibble: 265,836 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      544            545        -1     1004
##  3  2013     1     1      554            600        -6      812
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      709
##  6  2013     1     1      557            600        -3      838
##  7  2013     1     1      558            600        -2      753
##  8  2013     1     1      558            600        -2      849
##  9  2013     1     1      558            600        -2      853
## 10  2013     1     1      559            600        -1      941
## # ... with 265,826 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>%
  filter(!(carrier == "UA" | carrier == "WN"))
## # A tibble: 265,836 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      544            545        -1     1004
##  3  2013     1     1      554            600        -6      812
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      709
##  6  2013     1     1      557            600        -3      838
##  7  2013     1     1      558            600        -2      753
##  8  2013     1     1      558            600        -2      849
##  9  2013     1     1      558            600        -2      853
## 10  2013     1     1      559            600        -1      941
## # ... with 265,826 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>%
  filter(!(carrier %in% c("UA", "WN")))
## # A tibble: 265,836 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      544            545        -1     1004
##  3  2013     1     1      554            600        -6      812
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      709
##  6  2013     1     1      557            600        -3      838
##  7  2013     1     1      558            600        -2      753
##  8  2013     1     1      558            600        -2      849
##  9  2013     1     1      558            600        -2      853
## 10  2013     1     1      559            600        -1      941
## # ... with 265,826 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Missing data

Which flights have a missing departure or arrival time? Which have both missing? Can the number of flights that have a missing arrival but not departure time correspond to lost or crashed flights?

flights %>%
  filter(is.na(___))

flights %>%
  filter(___(___) ___ _____)

flights %>%
  filter(_____ ___ !_____)

► Solution:

flights %>%
  filter(is.na(arr_time))
## # A tibble: 8,713 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1     2016           1930        46       NA
##  2  2013     1     1       NA           1630        NA       NA
##  3  2013     1     1       NA           1935        NA       NA
##  4  2013     1     1       NA           1500        NA       NA
##  5  2013     1     1       NA            600        NA       NA
##  6  2013     1     2     2041           2045        -4       NA
##  7  2013     1     2     2145           2129        16       NA
##  8  2013     1     2       NA           1540        NA       NA
##  9  2013     1     2       NA           1620        NA       NA
## 10  2013     1     2       NA           1355        NA       NA
## # ... with 8,703 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>%
  filter(is.na(arr_time) & is.na(dep_time))
## # A tibble: 8,255 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>%
  filter(is.na(arr_time) & !is.na(dep_time))
## # A tibble: 458 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1     2016           1930        46       NA
##  2  2013     1     2     2041           2045        -4       NA
##  3  2013     1     2     2145           2129        16       NA
##  4  2013     1     9      615            615         0       NA
##  5  2013     1     9     2042           2040         2       NA
##  6  2013     1    11     1344           1350        -6       NA
##  7  2013     1    13     1907           1634       153       NA
##  8  2013     1    13     2239           2159        40       NA
##  9  2013     1    16      837            840        -3       NA
## 10  2013     1    25     1452           1500        -8       NA
## # ... with 448 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Commutativity

Find two equivalent ways to select the six "UA" flights with the lowest delay. Which is faster? Why?

Hint: RStudio has shortcuts for swapping the current line with the next or previous line.

flights %>%
  filter(___) %>%
  arrange(___)

flights %>%
  arrange(___) %>%
  filter(___)

► Solution: Arranging a smaller subset needs less work.

system.time(
  flights %>%
    arrange(arr_delay) %>% 
    filter(carrier == "UA")
)
##    user  system elapsed 
##   0.216   0.000   0.216
system.time(
  flights %>%
    filter(carrier == "UA") %>%
    arrange(arr_delay)
)
##    user  system elapsed 
##   0.040   0.000   0.042

Understand predicate logic

Look at the “Details” section in the help page for | with help("|") to understand predicate logic in R. (We need element-wise comparisons.)

More exercises

Find more exercises in items 1 and 4 of Section 5.2.4, and in Section 5.3.1, of r4ds.

Copyright © 2018 Kirill Müller. Licensed under CC BY-NC 4.0.