Find all flights that departed today x years ago, flown by "US"
. Two simple solutions exist, which appeals most to you?
flights %>%
filter(___, ___) %>%
filter(___)
flights %>%
filter(_____)
► Solution:
flights %>%
filter(day == 13, month == 11) %>%
filter(carrier == "US")
## # A tibble: 66 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 11 13 458 500 -2 632 651
## 2 2013 11 13 551 600 -9 638 658
## 3 2013 11 13 553 600 -7 650 659
## 4 2013 11 13 606 615 -9 741 818
## 5 2013 11 13 607 615 -8 758 817
## 6 2013 11 13 622 630 -8 818 833
## 7 2013 11 13 625 630 -5 824 840
## 8 2013 11 13 626 630 -4 915 1018
## 9 2013 11 13 631 600 31 739 701
## 10 2013 11 13 653 700 -7 753 809
## # … with 56 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(day == 13 & month == 11 & carrier == "US")
## # A tibble: 66 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 11 13 458 500 -2 632 651
## 2 2013 11 13 551 600 -9 638 658
## 3 2013 11 13 553 600 -7 650 659
## 4 2013 11 13 606 615 -9 741 818
## 5 2013 11 13 607 615 -8 758 817
## 6 2013 11 13 622 630 -8 818 833
## 7 2013 11 13 625 630 -5 824 840
## 8 2013 11 13 626 630 -4 915 1018
## 9 2013 11 13 631 600 31 739 701
## 10 2013 11 13 653 700 -7 753 809
## # … with 56 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Find all flights that departed before 6:00 AM or after 10:00 PM.
flights %>%
filter(___ | ___)
► Solution:
flights %>%
filter(dep_time < 600 | dep_time > 2200)
## # A tibble: 16,753 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 16,743 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Find all flights not flown by either "UA"
or "WN"
. Can you think of three different solutions? Which appeals more to you?
flights %>%
filter(___ ___ ___)
flights %>%
filter(!(___) ___ !(___))
flights %>%
filter(!(_____))
► Solution:
flights %>%
filter(carrier != "UA" & carrier != "WN")
## # A tibble: 265,836 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 554 600 -6 812 837
## 4 2013 1 1 555 600 -5 913 854
## 5 2013 1 1 557 600 -3 709 723
## 6 2013 1 1 557 600 -3 838 846
## 7 2013 1 1 558 600 -2 753 745
## 8 2013 1 1 558 600 -2 849 851
## 9 2013 1 1 558 600 -2 853 856
## 10 2013 1 1 559 600 -1 941 910
## # … with 265,826 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(!(carrier == "UA" | carrier == "WN"))
## # A tibble: 265,836 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 554 600 -6 812 837
## 4 2013 1 1 555 600 -5 913 854
## 5 2013 1 1 557 600 -3 709 723
## 6 2013 1 1 557 600 -3 838 846
## 7 2013 1 1 558 600 -2 753 745
## 8 2013 1 1 558 600 -2 849 851
## 9 2013 1 1 558 600 -2 853 856
## 10 2013 1 1 559 600 -1 941 910
## # … with 265,826 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(!(carrier %in% c("UA", "WN")))
## # A tibble: 265,836 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 554 600 -6 812 837
## 4 2013 1 1 555 600 -5 913 854
## 5 2013 1 1 557 600 -3 709 723
## 6 2013 1 1 557 600 -3 838 846
## 7 2013 1 1 558 600 -2 753 745
## 8 2013 1 1 558 600 -2 849 851
## 9 2013 1 1 558 600 -2 853 856
## 10 2013 1 1 559 600 -1 941 910
## # … with 265,826 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Which flights have a missing departure or arrival time? Which have both missing? Can the number of flights that have a missing arrival but not departure time correspond to lost or crashed flights?
flights %>%
filter(is.na(___))
flights %>%
filter(___(___) ___ _____)
flights %>%
filter(_____ ___ !_____)
► Solution:
flights %>%
filter(is.na(arr_time))
## # A tibble: 8,713 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 2016 1930 46 NA 2220
## 2 2013 1 1 NA 1630 NA NA 1815
## 3 2013 1 1 NA 1935 NA NA 2240
## 4 2013 1 1 NA 1500 NA NA 1825
## 5 2013 1 1 NA 600 NA NA 901
## 6 2013 1 2 2041 2045 -4 NA 2359
## 7 2013 1 2 2145 2129 16 NA 33
## 8 2013 1 2 NA 1540 NA NA 1747
## 9 2013 1 2 NA 1620 NA NA 1746
## 10 2013 1 2 NA 1355 NA NA 1459
## # … with 8,703 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(is.na(arr_time) & is.na(dep_time))
## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 NA 1630 NA NA 1815
## 2 2013 1 1 NA 1935 NA NA 2240
## 3 2013 1 1 NA 1500 NA NA 1825
## 4 2013 1 1 NA 600 NA NA 901
## 5 2013 1 2 NA 1540 NA NA 1747
## 6 2013 1 2 NA 1620 NA NA 1746
## 7 2013 1 2 NA 1355 NA NA 1459
## 8 2013 1 2 NA 1420 NA NA 1644
## 9 2013 1 2 NA 1321 NA NA 1536
## 10 2013 1 2 NA 1545 NA NA 1910
## # … with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(is.na(arr_time) & !is.na(dep_time))
## # A tibble: 458 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 2016 1930 46 NA 2220
## 2 2013 1 2 2041 2045 -4 NA 2359
## 3 2013 1 2 2145 2129 16 NA 33
## 4 2013 1 9 615 615 0 NA 855
## 5 2013 1 9 2042 2040 2 NA 2357
## 6 2013 1 11 1344 1350 -6 NA 1518
## 7 2013 1 13 1907 1634 153 NA 1837
## 8 2013 1 13 2239 2159 40 NA 30
## 9 2013 1 16 837 840 -3 NA 1030
## 10 2013 1 25 1452 1500 -8 NA 1619
## # … with 448 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Find two equivalent ways to select the six "UA"
flights with the lowest delay. Which is faster? Why?
Hint: RStudio has shortcuts for swapping the current line with the next or previous line.
flights %>%
filter(___) %>%
arrange(___)
flights %>%
arrange(___) %>%
filter(___)
► Solution:
Arranging a smaller subset needs less work.
system.time(
flights %>%
arrange(arr_delay) %>%
filter(carrier == "UA")
)
## user system elapsed
## 0.217 0.000 0.216
system.time(
flights %>%
filter(carrier == "UA") %>%
arrange(arr_delay)
)
## user system elapsed
## 0.042 0.000 0.042
Look at the “Details” section in the help page for |
with help("|")
to understand predicate logic in R. (We need element-wise comparisons.)
Find more exercises in items 1 and 4 of Section 5.2.4, and in Section 5.3.1, of r4ds.
Copyright © 2019 Kirill Müller. Licensed under CC BY-NC 4.0.