On what day did the flight with the shortest airtime take place?
Hint: Use head()
to restrict your result to one row only.
flights %>%
arrange(___) %>%
head(1)
► Solution:
flights %>%
arrange(air_time) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 16 1355 1315 40 1442 1411
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Which flights had the heaviest delays? Can you use the tail()
verb to obtain this information?
flights %>%
arrange(___) %>%
tail(1)
flights %>%
arrange(desc(___)) %>%
___(1)
► Solution:
flights %>%
arrange(arr_delay) %>%
tail(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 30 NA 840 NA NA 1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Why doesn’t this give the result we’re looking for? Can we use a filter?
flights %>%
filter(!is.na(arr_delay)) %>%
arrange(arr_delay) %>%
tail(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Or the pattern below?
flights %>%
arrange(!is.na(arr_delay), arr_delay) %>%
tail(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Usually it’s easiest to sort in descending order:
flights %>%
arrange(-arr_delay) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
arrange(desc(arr_delay)) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
On what day did the flight with the longest airtime take place?
flights %>%
arrange(___ - ___) %>%
head(1)
► Solution:
flights %>%
arrange(desc(air_time)) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 3 17 1337 1335 2 1937 1836
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Find two equivalent ways to select the six "UA"
flights with the lowest delay. Which is faster? Why?
Hint: RStudio has shortcuts for swapping the current line with the next or previous line.
flights %>%
filter(___) %>%
arrange(___)
flights %>%
arrange(___) %>%
filter(___)
► Solution:
If we filter first, fewer observations need to be sorted.
flights %>%
filter(carrier == "UA") %>%
arrange(arr_delay)
## # A tibble: 58,665 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 5 2 1947 1949 -2 2209 2324
## 2 2013 5 2 1926 1929 -3 2157 2310
## 3 2013 5 7 2054 2055 -1 2317 28
## 4 2013 2 26 1335 1335 0 1819 1929
## 5 2013 2 26 1721 1725 -4 1936 2046
## 6 2013 2 28 702 705 -3 924 1034
## 7 2013 5 13 1624 1629 -5 1831 1941
## 8 2013 5 4 1914 1915 -1 2107 2216
## 9 2013 12 27 853 856 -3 1052 1200
## 10 2013 3 1 629 632 -3 844 952
## # … with 58,655 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
arrange(arr_delay) %>%
filter(carrier == "UA")
## # A tibble: 58,665 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 5 2 1947 1949 -2 2209 2324
## 2 2013 5 2 1926 1929 -3 2157 2310
## 3 2013 5 7 2054 2055 -1 2317 28
## 4 2013 2 26 1335 1335 0 1819 1929
## 5 2013 2 26 1721 1725 -4 1936 2046
## 6 2013 2 28 702 705 -3 924 1034
## 7 2013 5 13 1624 1629 -5 1831 1941
## 8 2013 5 4 1914 1915 -1 2107 2216
## 9 2013 12 27 853 856 -3 1052 1200
## 10 2013 3 1 629 632 -3 844 952
## # … with 58,655 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Which flights were best in recovering from delay in the air?
► Solution:
Recovering from delay means that the arrival delay is lower than the departure delay, or that arrival minus departure delay is negative. If we arrange by arrival minus departure delay, negative values are sorted first, so they are easier to inspect.
flights %>%
arrange(arr_delay - dep_delay)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 6 13 1907 1512 235 2134 1928
## 2 2013 2 26 1000 900 60 1513 1540
## 3 2013 2 23 1226 900 206 1746 1540
## 4 2013 5 13 1917 1900 17 2149 2251
## 5 2013 2 27 924 900 24 1448 1540
## 6 2013 7 14 1917 1829 48 2109 2135
## 7 2013 7 17 2004 1930 34 2224 2304
## 8 2013 12 27 1719 1648 31 1956 2038
## 9 2013 5 2 1947 1949 -2 2209 2324
## 10 2013 11 13 2024 2015 9 2251 2354
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Copyright © 2019 Kirill Müller. Licensed under CC BY-NC 4.0.