On what day did the flight with the shortest airtime take place?
Hint: Use head()
to restrict your result to one row only.
flights %>%
arrange(___) %>%
head(1)
► Solution:
flights %>%
arrange(air_time) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 16 1355 1315 40 1442
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Which flights had the heaviest delays? Can you use the tail()
verb to obtain this information?
flights %>%
arrange(___) %>%
tail(1)
flights %>%
arrange(desc(___)) %>%
___(1)
► Solution:
flights %>%
arrange(arr_delay) %>%
tail(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 9 30 NA 840 NA NA
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Why doesn’t this give the result we’re looking for? Can we use a filter?
flights %>%
filter(!is.na(arr_delay)) %>%
arrange(arr_delay) %>%
tail(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Or the pattern below?
flights %>%
arrange(!is.na(arr_delay), arr_delay) %>%
tail(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Usually it’s easiest to sort in descending order:
flights %>%
arrange(-arr_delay) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
flights %>%
arrange(desc(arr_delay)) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
On what day did the flight with the longest airtime take place?
flights %>%
arrange(___ - ___) %>%
head(1)
► Solution:
flights %>%
arrange(desc(air_time)) %>%
head(1)
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 3 17 1337 1335 2 1937
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Find two equivalent ways to select the six "UA"
flights with the lowest delay. Which is faster? Why?
Hint: RStudio has shortcuts for swapping the current line with the next or previous line.
flights %>%
filter(___) %>%
arrange(___)
flights %>%
arrange(___) %>%
filter(___)
► Solution:
If we filter first, fewer observations need to be sorted.
flights %>%
filter(carrier == "UA") %>%
arrange(arr_delay)
## # A tibble: 58,665 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 5 2 1947 1949 -2 2209
## 2 2013 5 2 1926 1929 -3 2157
## 3 2013 5 7 2054 2055 -1 2317
## 4 2013 2 26 1335 1335 0 1819
## 5 2013 2 26 1721 1725 -4 1936
## 6 2013 2 28 702 705 -3 924
## 7 2013 5 13 1624 1629 -5 1831
## 8 2013 5 4 1914 1915 -1 2107
## 9 2013 12 27 853 856 -3 1052
## 10 2013 3 1 629 632 -3 844
## # … with 58,655 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
flights %>%
arrange(arr_delay) %>%
filter(carrier == "UA")
## # A tibble: 58,665 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 5 2 1947 1949 -2 2209
## 2 2013 5 2 1926 1929 -3 2157
## 3 2013 5 7 2054 2055 -1 2317
## 4 2013 2 26 1335 1335 0 1819
## 5 2013 2 26 1721 1725 -4 1936
## 6 2013 2 28 702 705 -3 924
## 7 2013 5 13 1624 1629 -5 1831
## 8 2013 5 4 1914 1915 -1 2107
## 9 2013 12 27 853 856 -3 1052
## 10 2013 3 1 629 632 -3 844
## # … with 58,655 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Which flights were best in recovering from delay in the air?
► Solution:
Recovering from delay means that the arrival delay is lower than the departure delay, or that arrival minus departure delay is negative. If we arrange by arrival minus departure delay, negative values are sorted first, so they are easier to inspect.
flights %>%
arrange(arr_delay - dep_delay)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 6 13 1907 1512 235 2134
## 2 2013 2 26 1000 900 60 1513
## 3 2013 2 23 1226 900 206 1746
## 4 2013 5 13 1917 1900 17 2149
## 5 2013 2 27 924 900 24 1448
## 6 2013 7 14 1917 1829 48 2109
## 7 2013 7 17 2004 1930 34 2224
## 8 2013 12 27 1719 1648 31 1956
## 9 2013 5 2 1947 1949 -2 2209
## 10 2013 11 13 2024 2015 9 2251
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Copyright © 2019 Kirill Müller. Licensed under CC BY-NC 4.0.