The goal of the package {dm} and its dm class is to facilitate working with multiple related tables.

An object of the dm class contains the data in the tables, and metadata about the tables, such as

  • the names of the tables
  • the names of the columns of the tables
  • the key constraints to link the tables together
  • the data (either as data frames or as references to database tables)

This package augments {dplyr}/{dbplyr} workflows:

  • multiple related tables are kept in a single compound object
  • joins across multiple tables are available by specifying the tables involved, without a need to memorize column names or relationships

In addition, a battery of utilities is provided that helps with creating a tidy data model.

This package follows several of the “tidyverse” rules:

  • dm objects are immutable (your data will never be overwritten in place)
  • many functions used on dm objects are pipeable (i.e., return new dm objects)
  • tidy evaluation is used (unquoted function parameters are supported)

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a great collaboration!

We will now demonstrate some of the features of {dm}:

  1. Creation of dm objects
  2. Setting keys and drawing
  3. Filtering
  4. Copying and discovery

Let’s first have a brief look at how to create a dm-class object.

library(tidyverse)
library(dm)

Creating dm objects:

The {nycflights13} package offers a nice example of interconnected tables. The most straightforward way of squeezing those tables into a dm object is:

flights_dm <- dm_from_src(src_df(pkg = "nycflights13"))
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

This fairly verbose output shows the data and metadata of a dm object. These components can be accessed with dm_get_src() and dm_get_tables().

#> # A tibble: 0 x 3
#> # … with 3 variables: table <chr>, filter <list>, zoomed <lgl>
dm_get_all_pks(flights_dm)
#> # A tibble: 0 x 2
#> # … with 2 variables: table <chr>, pk_col <list<chr>>
dm_get_all_fks(flights_dm)
#> # A tibble: 0 x 3
#> # … with 3 variables: child_table <chr>, child_fk_cols <list<chr>>,
#> #   parent_table <chr>

Keys and visualization

As you can see in the output above, no keys have been set so far. We will use dm_add_pk() and dm_add_fk() to add primary keys (pk) and foreign keys (fk):

flights_dm_with_one_key <- 
  flights_dm %>% 
  dm_add_pk(airlines, carrier) %>% 
  dm_add_fk(flights, carrier, airlines)

After you set the keys and establish relations, you can create a graphical representation of your data model with dm_draw():

%0 airlines airlinescarrierairports airportsflights flightscarrierflights:carrier->airlines:carrier planes planesweather weather

The dm_nycflights13() function provides a shortcut: the dm object returned by this function contains all tables, defines all primary and foreign keys, and even assigns colors to the different types of tables. We will be using the dm object created by this function from now on.

%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumorigindestflights:carrier->airlines:carrier flights:origin->airports:faa flights:dest->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

Filtering a table of a dm object

The idea of a filter on a dm object:

  1. You can filter one or more of dm’s tables, just like with normal dplyr::filter() calls
  2. Filtering conditions are immediately executed for the table in question and additionally stored in the dm object
  3. If you access a table via dm_apply_filters_to_tbl(), a sequence of semi_join() calls is performed to retrieve the requested table with only those values in the key columns which correspond to the remaining values in the filtered tables

The function dm_apply_filters() essentially calls dm_apply_filters_to_tbl() for each table of the dm and creates a new dm object from the result.

Currently, this only works if the graph induced by the fk relations is cycle free, the default for dm_nycflights13():

%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

Let’s set two filters:

us_flights_from_jfk_prepared <- 
  flights_dm_acyclic %>%
  dm_filter(airports, name == "John F Kennedy Intl") %>% 
  dm_filter(airlines, name == "US Airways Inc.")
us_flights_from_jfk_prepared
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 3
#> ── Filters ────────────────────────────────────────────────────────────────
#> airlines: name == "US Airways Inc."
#> airports: name == "John F Kennedy Intl"

With dm_apply_filters(), you can apply all filters and therefore update all tables in the dm, so that they contain only the rows that are relevant to the filters. The effect of the filters can be seen when counting the rows:

Alternatively, you can just pull out one of the tables from dm to answer the question which planes were used to service the US Airways flights that departed from JFK airport:

dm_apply_filters_to_tbl(us_flights_from_jfk, "planes")
#> # A tibble: 224 x 9
#>    tailnum  year type       manufacturer  model  engines seats speed engine
#>    <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
#>  1 N102UW   1998 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  2 N103US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  3 N104UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  4 N105UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  5 N107US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  6 N108UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  7 N109UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  8 N110UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  9 N111US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#> 10 N112US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#> # … with 214 more rows

Each of the planes in the result set above was a part of at least one US Airways flight departing from JFK. Do they have any common characteristics?

dm_apply_filters_to_tbl(us_flights_from_jfk, "planes") %>% 
  count(model)
#> # A tibble: 8 x 2
#>   model               n
#>   <chr>           <int>
#> 1 A319-112           54
#> 2 A319-132            2
#> 3 A320-214           23
#> 4 A320-231            4
#> 5 A320-232           35
#> 6 A321-211           43
#> 7 A321-231           48
#> 8 ERJ 190-100 IGW    15

For comparison, let’s look at the equivalent manual query in {dplyr} syntax:

flights %>% 
  left_join(airports, by = c("origin" = "faa")) %>% 
  filter(name == "John F Kennedy Intl") %>%
  left_join(airlines, by = "carrier") %>% 
  filter(name.y == "US Airways Inc.") %>%
  semi_join(planes, ., by = "tailnum") %>% 
  count(model)

The {dm} code is leaner because the fk information is encoded in the object.

Mind, that if you access a table via tbl.dm(), $.dm() or [[.dm(), filter conditions set for other tables are ignored.

Joining two tables

The dm_join_to_tbl() function joins two immediately related tables in a data model. The definition of the primary and foreign key constraints is used to define the relationship.

flights_dm_with_keys %>%
  dm_join_to_tbl(airlines, flights, join = left_join)
#> # A tibble: 336,776 x 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1     1      517            515         2      830
#>  2  2013     1     1      533            529         4      850
#>  3  2013     1     1      542            540         2      923
#>  4  2013     1     1      544            545        -1     1004
#>  5  2013     1     1      554            600        -6      812
#>  6  2013     1     1      554            558        -4      740
#>  7  2013     1     1      555            600        -5      913
#>  8  2013     1     1      557            600        -3      709
#>  9  2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name <chr>

The same operation using {dplyr} syntax looks like this:

library(nycflights13)
airlines %>% 
  left_join(flights, by = "carrier")

Omitting the by argument leads to a warning.

Copy a dm object to a DB or learn from it

dm objects can be transferred from one src to another. The relevant verb is copy_dm_to(), which will copy both data and key constraints.

As a result, the tables are transferred to the target data source, and all keys will be contained in the returned data model.

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  sqlite 3.30.1 [:memory:]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 4

In the opposite direction, dm objects can also be “learned” from a DB, including the key constraints, by utilizing the DB’s meta-information tables. Unfortunately, this currently only works for MSSQL and Postgres, so we cannot show the results here just yet:

Further reading