vignettes/tech-dm-join.Rmd
tech-dm-join.Rmd
The {dm} package offers functions to work with relational data models in R. A common task for multiple, separated tables that have a shared attribute is merging the data.
This document introduces you to the joining functions of {dm} and shows how to apply them using data from the {nycflights13} package.
Relational data models consist of multiple tables that are linked with foreign keys. They are the building blocks for joining tables. Read more about relational data models in the vignette “Introduction to Relational Data Models”.
First, we load the packages that we need:
To explore filtering with {dm}, we’ll use the {nycflights13} data with its tables flights
, planes
, airlines
and airports
.
This dataset contains information about the 336 776 flights that departed from New York City in 2013, with 3322 different planes and 1458 airports involved. The data comes from the US Bureau of Transportation Statistics, and is documented in ?nycflights13
.
First, we have to create a dm
object from the {nycflights13} data. This is implemented with dm::dm_nycflights13()
.
A data model object contains the data as well as metadata.
If you would like to create a dm
from other tables, please look at ?dm
and the function new_dm()
.
dm <- dm_nycflights13()
dm
object{dm} allows you to join two tables of a dm
object based on a shared column. You can use all join functions that you know from the {dplyr} package. Currently {dplyr} supports four types of mutating joins, two types of filtering joins, and a nesting join. See ?dplyr::join
for details.
A join is the combination of two tables based on shared information. In technical terms, we merge the tables that need to be directly connected by a foreign key relation.
The existing links can be inspected in two ways:
dm_draw()
dm %>%
dm_draw()
The directed arrows show explicitly the relation between different columns.
dm_get_all_fks()
dm %>%
dm_get_all_fks()
#> # A tibble: 3 x 3
#> child_table child_fk_cols parent_table
#> <chr> <keys> <chr>
#> 1 flights carrier airlines
#> 2 flights origin airports
#> 3 flights tailnum planes
Let’s look at some examples:
Add a column with airline names from the airlines
table to the flights
table.
dm_joined <- dm %>%
dm_join_to_tbl(flights, airlines, join = left_join)
dm_joined
#> # A tibble: 11,227 x 20
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # … with 11,217 more rows, and 13 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, name <chr>
As you can see below, the dm_joined
dataframe has one more column than the flights
table. The difference is the name
column from the airlines
table.
dm %>%
tbl("flights") %>%
names()
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier" "flight" "tailnum"
#> [13] "origin" "dest" "air_time" "distance"
#> [17] "hour" "minute" "time_hour"
dm %>%
tbl("airlines") %>%
names()
#> [1] "carrier" "name"
dm_joined %>%
names()
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier" "flight" "tailnum"
#> [13] "origin" "dest" "air_time" "distance"
#> [17] "hour" "minute" "time_hour" "name"
The result is not a dm
object anymore, but a conventional dataframe:
dm_joined %>%
class()
#> [1] "tbl_df" "tbl" "data.frame"
Another example:
Get all flights that can’t be matched with airlines names.
We expect the flights data from {nycflights13} package to be clean and well organized, so no flights should remain. You can check this with an anti_join
:
dm %>%
dm_join_to_tbl(flights, airlines, join = anti_join)
#> # A tibble: 0 x 19
#> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
#> # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
An example with filtering on a dm
and then merging:
Get all flights from Delta Air Lines which didn’t depart from John F. Kennedy International Airport in May 2013 - and join all the airports data into the flights
table.
Currently, it is important to call dm_apply_filters()
after piping your conditions. Only then the underlying tables and key relations are updated and you can perform a join on the filtered data. We are working towards removing this inconvenience #62.
dm_nycflights13() %>%
dm_filter(airlines, name == "Delta Air Lines Inc.") %>%
dm_filter(flights, month == 5) %>%
dm_apply_filters() %>%
dm_join_to_tbl(flights, airports, join = left_join)
#> # A tibble: 136 x 26
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 5 10 554 600 -6 739
#> 2 2013 5 10 556 600 -4 825
#> 3 2013 5 10 606 610 -4 743
#> 4 2013 5 10 625 630 -5 843
#> 5 2013 5 10 632 635 -3 847
#> 6 2013 5 10 653 700 -7 923
#> 7 2013 5 10 654 700 -6 1001
#> 8 2013 5 10 656 700 -4 1008
#> 9 2013 5 10 656 700 -4 911
#> 10 2013 5 10 657 700 -3 1006
#> # … with 126 more rows, and 19 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>, lon <dbl>,
#> # alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
A last example:
Merge all tables into one big table.
Sometimes you need everything in one place. In this case you can use the dm_flatten_to_tbl()
function. It joins all the tables of your dm
object together into one wide table. All you have to do is to specify the starting table. The following joins are determined by the foreign key links.
dm_nycflights13() %>%
dm_select_tbl(-weather) %>%
dm_flatten_to_tbl(start = flights)
#> Renamed columns:
#> * year -> flights.year, planes.year
#> * name -> airlines.name, airports.name
#> # A tibble: 11,227 x 35
#> flights.year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # … with 11,217 more rows, and 28 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, airlines.name <chr>,
#> # airports.name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>,
#> # dst <chr>, tzone <chr>, planes.year <int>, type <chr>,
#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> # speed <int>, engine <chr>
Be aware that all column names need to be unique. The dm_flatten_to_tbl
cares about automatically renaming the relevant columns and prints if something was changed, e.g. name -> airlines.name
.