Flight with shortest airtime

On what day did the flight with the shortest airtime take place?

Hint: Use head() to restrict your result to one row only.

flights %>% 
  arrange(___) %>%
  head(1)

► Solution:

flights %>% 
  arrange(air_time) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1    16     1355           1315        40     1442           1411
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Flight with heaviest delay

Which flights had the heaviest delays? Can you use the tail() verb to obtain this information?

flights %>% 
  arrange(___) %>%
  tail(1)

flights %>% 
  arrange(desc(___)) %>%
  ___(1)

► Solution:

flights %>% 
  arrange(arr_delay) %>%
  tail(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA            840        NA       NA           1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Why doesn’t this give the result we’re looking for? Can we use a filter?

flights %>% 
  filter(!is.na(arr_delay)) %>%
  arrange(arr_delay) %>%
  tail(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     9      641            900      1301     1242           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Or the pattern below?

flights %>% 
  arrange(!is.na(arr_delay), arr_delay) %>%
  tail(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     9      641            900      1301     1242           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Usually it’s easiest to sort in descending order:

flights %>% 
  arrange(-arr_delay) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     9      641            900      1301     1242           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% 
  arrange(desc(arr_delay)) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     9      641            900      1301     1242           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Flight with longest airtime

On what day did the flight with the longest airtime take place?

flights %>% 
  arrange(___ - ___) %>%
  head(1)

► Solution:

flights %>% 
  arrange(desc(air_time)) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     3    17     1337           1335         2     1937           1836
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

UA flights with lowest delay

Find two equivalent ways to select the six "UA" flights with the lowest delay. Which is faster? Why?

Hint: RStudio has shortcuts for swapping the current line with the next or previous line.

flights %>%
  filter(___) %>%
  arrange(___)

flights %>%
  arrange(___) %>%
  filter(___)

► Solution: If we filter first, fewer observations need to be sorted.

flights %>% 
  filter(carrier == "UA") %>%
  arrange(arr_delay)
## # A tibble: 58,665 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     5     2     1947           1949        -2     2209           2324
##  2  2013     5     2     1926           1929        -3     2157           2310
##  3  2013     5     7     2054           2055        -1     2317             28
##  4  2013     2    26     1335           1335         0     1819           1929
##  5  2013     2    26     1721           1725        -4     1936           2046
##  6  2013     2    28      702            705        -3      924           1034
##  7  2013     5    13     1624           1629        -5     1831           1941
##  8  2013     5     4     1914           1915        -1     2107           2216
##  9  2013    12    27      853            856        -3     1052           1200
## 10  2013     3     1      629            632        -3      844            952
## # … with 58,655 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% 
  arrange(arr_delay) %>%
  filter(carrier == "UA")
## # A tibble: 58,665 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     5     2     1947           1949        -2     2209           2324
##  2  2013     5     2     1926           1929        -3     2157           2310
##  3  2013     5     7     2054           2055        -1     2317             28
##  4  2013     2    26     1335           1335         0     1819           1929
##  5  2013     2    26     1721           1725        -4     1936           2046
##  6  2013     2    28      702            705        -3      924           1034
##  7  2013     5    13     1624           1629        -5     1831           1941
##  8  2013     5     4     1914           1915        -1     2107           2216
##  9  2013    12    27      853            856        -3     1052           1200
## 10  2013     3     1      629            632        -3      844            952
## # … with 58,655 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Recovering delay

Which flights were best in recovering from delay in the air?

► Solution: Recovering from delay means that the arrival delay is lower than the departure delay, or that arrival minus departure delay is negative. If we arrange by arrival minus departure delay, negative values are sorted first, so they are easier to inspect.

flights %>% 
  arrange(arr_delay - dep_delay)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     6    13     1907           1512       235     2134           1928
##  2  2013     2    26     1000            900        60     1513           1540
##  3  2013     2    23     1226            900       206     1746           1540
##  4  2013     5    13     1917           1900        17     2149           2251
##  5  2013     2    27      924            900        24     1448           1540
##  6  2013     7    14     1917           1829        48     2109           2135
##  7  2013     7    17     2004           1930        34     2224           2304
##  8  2013    12    27     1719           1648        31     1956           2038
##  9  2013     5     2     1947           1949        -2     2209           2324
## 10  2013    11    13     2024           2015         9     2251           2354
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Copyright © 2019 Kirill Müller. Licensed under CC BY-NC 4.0.