The goal of both the package {dm} and its dm class is to facilitate work with multiple related tables. An object of the dm class contains all relevant information about the tables in a data model:

  1. the place where the tables live (i.e., the src: a database (DB) or locally in your R session)
  2. the meta-data from the data model: the tables, columns, and key constraints
  3. the data in the tables

This package augments {dplyr}/{dbplyr} workflows:

  • multiple related tables are kept in a single compound object,
  • joins across multiple tables are available by stating the tables involved, no need to memorize column names or relationships

In addition, a battery of utilities is provided that helps with creating a tidy data model.

This package follows several of the “tidyverse” rules:

  • dm objects are immutable (your data will never be overwritten in place)
  • many functions used on dm objects are pipeable (i.e., return new dm objects)
  • tidy evaluation is used (unquoted function parameters are supported)

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!

We will showcase here some of the features of {dm}:

  1. Creation of dm objects
  2. Setting keys and drawing
  3. Filtering
  4. Copying and discovery

Let’s first have a brief look at how to create a dm-class object.

library(dm)
library(tidyverse)

Creating dm objects:

The {nycflights13}-package offers a nice example for interconnected tables. The most straightforward way of squeezing them into a dm object is:

flights_dm <- dm(src_df(pkg = "nycflights13"))
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   0 primary keys
#>   0 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

The fairly verbose output shows the three components of a dm object: the table source, the metadata, and row counts. These components can be accessed with cdm_get_src(), cdm_get_tables() and cdm_get_data_model().

Keys and visualization

As you can see in the “Data model” part of the output above, so far no keys are set. We use cdm_add_pk() and cdm_add_fk() to add primary keys (pk) and foreign keys (fk):

flights_dm_with_one_key <- 
  flights_dm %>% 
  cdm_add_pk(airlines, carrier) %>% 
  cdm_add_fk(flights, carrier, airlines)

Once you set the keys and established relations, you can show a graphical representation of your data model with cdm_draw():

%0 airlines airlinescarrierairports airportsflights flightscarrierflights:carrier->airlines:carrier planes planesweather weather

The cdm_nycflights13() function provides a shortcut: the dm object returned by this function contains all tables, defines all primary and foreign keys and even assigns colors to the different types of tables. We will be using the dm object created by this function from here on.

%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumorigindestflights:carrier->airlines:carrier flights:origin->airports:faa flights:dest->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

Filtering a table of a dm object

The idea of a filter on a dm object:

  1. You filter one or more of dm’s tables, just like with normal dplyr::filter() calls
  2. The filter conditions are stored in the dm object, but not immediately executed
  3. Once you access a table, the relevant filter conditions are applied to the respective tables and the necessary joins are performed along the fk relations. What you get out is a table containing only rows related to the filter operations

Currently, this only works if the graph induced by the fk relations is cycle free, the default for cdm_nycflights13():

%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

Let’s set two filters:

us_flights_from_jfk_prepared <- 
  flights_dm_acyclic %>%
  cdm_filter(airports, name == "John F Kennedy Intl") %>% 
  cdm_filter(airlines, name == "US Airways Inc.")
us_flights_from_jfk_prepared
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   3 primary keys
#>   3 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> airlines: name == "US Airways Inc."
#> airports: name == "John F Kennedy Intl"

With cdm_apply_filters() you can apply all filters and therefore update all tables in the dm, so that they contain only the rows relevant to the filters. Counting the rows you can see the effect of the filters:

Or you can just pull out one of the tables from the dm, answering the question which planes were used to service US Airways flights having started from JFK airport:

#> # A tibble: 224 x 9
#>    tailnum  year type       manufacturer  model  engines seats speed engine
#>    <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
#>  1 N102UW   1998 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  2 N103US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  3 N104UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  4 N105UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  5 N107US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  6 N108UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  7 N109UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  8 N110UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  9 N111US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#> 10 N112US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#> # … with 214 more rows

Each of the planes in the result set above was part in at least one US Airways flight departing from JFK. Do they have common characteristics?

#> # A tibble: 8 x 2
#>   model               n
#>   <chr>           <int>
#> 1 A319-112           54
#> 2 A319-132            2
#> 3 A320-214           23
#> 4 A320-231            4
#> 5 A320-232           35
#> 6 A321-211           43
#> 7 A321-231           48
#> 8 ERJ 190-100 IGW    15

For comparison, let’s review the equivalent manual query in {dplyr} syntax:

The {dm} code is leaner, because the fk information is encoded in the object.

Joining two tables

The cdm_join_to_tbl() function joins two immediately related tables in a data model. The definition of the primary and foreign key constraints is used to define the relationship.

flights_dm_with_keys %>%
  cdm_join_to_tbl(airlines, flights, join = left_join)
#> # A tibble: 336,776 x 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1     1      517            515         2      830
#>  2  2013     1     1      533            529         4      850
#>  3  2013     1     1      542            540         2      923
#>  4  2013     1     1      544            545        -1     1004
#>  5  2013     1     1      554            600        -6      812
#>  6  2013     1     1      554            558        -4      740
#>  7  2013     1     1      555            600        -5      913
#>  8  2013     1     1      557            600        -3      709
#>  9  2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name <chr>

The same operation using {dplyr} syntax looks like this:

Omitting the by argument leads to a warning

Copy a dm object to a DB or learn from it

dm objects can be transferred from one src to another. The relevant verb is cdm_copy_to(), which will copy both data and key constraints.

As a result, the tables are transferred to the target data source, and the returned data model contains all keys.

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  sqlite 3.29.0 [:memory:]
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   3 primary keys
#>   4 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

In the opposite direction, dm objects can also be “learned” from a DB, including the key constraints, by utilizing the DB’s meta-information tables. Unfortunately, this currently works only for MSSQL and Postgres, we cannot show the results here yet:

Further reading