Flight with shortest airtime

On what day did the flight with the shortest airtime take place?

Hint: Use head() to restrict your result to one row only.

flights %>% 
  arrange(___) %>%
  head(1)

► Solution:

flights %>% 
  arrange(air_time) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1    16     1355           1315        40     1442
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Flight with heaviest delay

Which flights had the heaviest delays? Can you use the tail() verb to obtain this information?

flights %>% 
  arrange(___) %>%
  tail(1)

flights %>% 
  arrange(desc(___)) %>%
  ___(1)

► Solution:

flights %>% 
  arrange(arr_delay) %>%
  tail(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     9    30       NA            840        NA       NA
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Why doesn’t this give the result we’re looking for? Can we use a filter?

flights %>% 
  filter(!is.na(arr_delay)) %>%
  arrange(arr_delay) %>%
  tail(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     9      641            900      1301     1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Or the pattern below?

flights %>% 
  arrange(!is.na(arr_delay), arr_delay) %>%
  tail(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     9      641            900      1301     1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Usually it’s easiest to sort in descending order:

flights %>% 
  arrange(-arr_delay) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     9      641            900      1301     1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>
flights %>% 
  arrange(desc(arr_delay)) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     9      641            900      1301     1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Flight with longest airtime

On what day did the flight with the longest airtime take place?

flights %>% 
  arrange(___ - ___) %>%
  head(1)

► Solution:

flights %>% 
  arrange(desc(air_time)) %>%
  head(1)
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     3    17     1337           1335         2     1937
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

UA flights with lowest delay

Find two equivalent ways to select the six "UA" flights with the lowest delay. Which is faster? Why?

Hint: RStudio has shortcuts for swapping the current line with the next or previous line.

flights %>%
  filter(___) %>%
  arrange(___)

flights %>%
  arrange(___) %>%
  filter(___)

► Solution: If we filter first, fewer observations need to be sorted.

flights %>% 
  filter(carrier == "UA") %>%
  arrange(arr_delay)
## # A tibble: 58,665 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     5     2     1947           1949        -2     2209
##  2  2013     5     2     1926           1929        -3     2157
##  3  2013     5     7     2054           2055        -1     2317
##  4  2013     2    26     1335           1335         0     1819
##  5  2013     2    26     1721           1725        -4     1936
##  6  2013     2    28      702            705        -3      924
##  7  2013     5    13     1624           1629        -5     1831
##  8  2013     5     4     1914           1915        -1     2107
##  9  2013    12    27      853            856        -3     1052
## 10  2013     3     1      629            632        -3      844
## # … with 58,655 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>% 
  arrange(arr_delay) %>%
  filter(carrier == "UA")
## # A tibble: 58,665 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     5     2     1947           1949        -2     2209
##  2  2013     5     2     1926           1929        -3     2157
##  3  2013     5     7     2054           2055        -1     2317
##  4  2013     2    26     1335           1335         0     1819
##  5  2013     2    26     1721           1725        -4     1936
##  6  2013     2    28      702            705        -3      924
##  7  2013     5    13     1624           1629        -5     1831
##  8  2013     5     4     1914           1915        -1     2107
##  9  2013    12    27      853            856        -3     1052
## 10  2013     3     1      629            632        -3      844
## # … with 58,655 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Recovering delay

Which flights were best in recovering from delay in the air?

► Solution: Recovering from delay means that the arrival delay is lower than the departure delay, or that arrival minus departure delay is negative. If we arrange by arrival minus departure delay, negative values are sorted first, so they are easier to inspect.

flights %>% 
  arrange(arr_delay - dep_delay)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     6    13     1907           1512       235     2134
##  2  2013     2    26     1000            900        60     1513
##  3  2013     2    23     1226            900       206     1746
##  4  2013     5    13     1917           1900        17     2149
##  5  2013     2    27      924            900        24     1448
##  6  2013     7    14     1917           1829        48     2109
##  7  2013     7    17     2004           1930        34     2224
##  8  2013    12    27     1719           1648        31     1956
##  9  2013     5     2     1947           1949        -2     2209
## 10  2013    11    13     2024           2015         9     2251
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Copyright © 2019 Kirill Müller. Licensed under CC BY-NC 4.0.