The goal of both the package dm and the class dm - which comes along with it - is to make your life easier when you are dealing with data models and its corresponding data.

Let’s first have a look at the new class:

Class dm

The dm class contains a set of tables as well as information about their primary keys and the relations between the tables. You can have your dm object on different data sources, such as a local environment or a database.

dm is a wrapper around two classes, which are independent from each other:

  1. part of the object is a src object from {dplyr}
  2. the other part is a class data_model object from the package {datamodelr}

The src object contains the information, where the tables of your data model are physically stored. In the data_model object the meta-information is about your data is kept, among others:

  • what are the primary keys in your schema?
  • how are the tables related to each other, i.e. what are the foreign key relations?

Last but not least, a third part of the object is a named list containing the tables on the src.

So much to the theory, let’s see how to construct such an object and how it looks in R:

Examples of dm objects

Relevant functions for creating dm objects are:

  1. dm()
  2. as_dm()
  3. new_dm()

We can use the tables from the famous package {nycflights13}:

library(dm)
library(dplyr)
flights_dm <- dm(src_df(pkg = "nycflights13"))
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   0 primary keys
#>   0 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

Here we make use of the fact, that the function dm(src, data_model = NULL) includes all available tables on a source in the dm object, should the data_model-argument be left NULL. Thus you can use this for example on a postgres database that you access via src_postgres() (with the appropriate arguments dbname, host, port, …), to produce a dm object with all the tables on the database.

Another way of creating a dm object is calling as_dm() on a list of tbls:

iris_dm <- as_dm(list("iris1" = iris, "iris2" = iris))
iris_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: 0x3f41128>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   2 tables:  iris1, iris2 
#>   10 columns
#>   0 primary keys
#>   0 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

And lastly, with new_dm() you can create a dm by providing the three individual parts it consists of:

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: 0x71071e0>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   2 tables:  iris1, iris2 
#>   10 columns
#>   0 primary keys
#>   0 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

You saw, that with cdm_get_data_model() we access the data_model part of a dm. Similarly, we can get the list of tables with cdm_get_tables() and the src object with cdm_get_src().

In order to pull a specific table from a dm, use:

tbl(flights_dm, "airports")
#> # A tibble: 1,458 x 8
#>    faa   name                    lat    lon   alt    tz dst   tzone        
#>    <chr> <chr>                 <dbl>  <dbl> <int> <dbl> <chr> <chr>        
#>  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
#>  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
#>  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
#>  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
#>  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
#>  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
#>  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
#>  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
#>  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
#> 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
#> # … with 1,448 more rows

But how can we use {dm}-functions to manage primary keys of the tables in a dm object?

Primary keys of dm objects

Useful functions for managing primary key settings are:

  1. cdm_add_pk()
  2. cdm_has_pk()
  3. cdm_get_pk()
  4. cdm_rm_pk()
  5. cdm_enum_pk_candidates()
  6. cdm_get_all_pks()

As of yet dm objects only support one-column primary keys. If your tables have unique compound keys, maybe consider adding a surrogate key column. If you created a dm object according to the examples in “Examples of dm objects”, your object does not yet have any primary keys set. So let’s add one. By default cdm_add_pk() checks if the column of the table given by the user is a unique key. Since the iris dataset does not have any unique one-column keys, we will use the nycflights13 tables, i.e. flights_dm from above.

cdm_has_pk(flights_dm, airports)
#> [1] FALSE
flights_dm_with_key <- cdm_add_pk(flights_dm, airports, faa)
flights_dm_with_key
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   1 primary keys
#>   0 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

The dm model now has a primary key. Let’s check:

cdm_has_pk(flights_dm_with_key, airports)
#> [1] TRUE

Get the name of the column that is marked as primary key of the table:

cdm_get_pk(flights_dm_with_key, airports)
#> [1] "faa"

Remove a primary key by:

cdm_rm_pk(flights_dm_with_key, airports) %>% 
  cdm_has_pk(airports)
#> [1] FALSE

If you still need to get to know your data better, and it is already available in the form of a dm object, you can use the function cdm_enum_pk_candidates() in order to get information, which columns of a table are unique keys:

cdm_enum_pk_candidates(flights_dm_with_key, airports)
#> # A tibble: 8 x 3
#>   column candidate why                                                     
#>   <chr>  <lgl>     <chr>                                                   
#> 1 faa    TRUE      ""                                                      
#> 2 name   FALSE     has duplicate values: All Airports, Capital City Airpor…
#> 3 lat    FALSE     has duplicate values: 38.88944, 40.63975                
#> 4 lon    TRUE      ""                                                      
#> 5 alt    FALSE     has duplicate values: 0, 1, 3, 4, 5, 6, …               
#> 6 tz     FALSE     has duplicate values: -10, -9, -8, -7, -6, -5, …        
#> 7 dst    FALSE     has duplicate values: A, N, U                           
#> 8 tzone  FALSE     "has duplicate values: \\N, America/Anchorage, America/…

Table flights does not have any one-column primary key candidates:

cdm_enum_pk_candidates(flights_dm_with_key, flights) %>% count(candidate)
#> # A tibble: 1 x 2
#>   candidate     n
#>   <lgl>     <int>
#> 1 FALSE        19

Get an overview over all tables with primary keys, use cdm_get_all_pks():

#> # A tibble: 3 x 2
#>   table    pk_col 
#>   <chr>    <chr>  
#> 1 airlines carrier
#> 2 airports faa    
#> 3 planes   tailnum

We used here the prepared dm object cdm_nycflights13(cycle = TRUE) as an example. This object already has all keys pre-set.

Foreign keys

Useful functions for managing foreign key relations are:

  1. cdm_add_fk()
  2. cdm_has_fk()
  3. cdm_get_fk()
  4. cdm_rm_fk()
  5. cdm_enum_fk_candidates()
  6. cdm_get_all_fks()

Now it gets (even more) interesting: we want to define relations between different tables. With the function cdm_add_fk() you can define, which column of which table points to another table’s column.

When dealing with data models, a foreign key from one table has to always point to a primary key of another table. The primary key of the referred table must be set with cdm_add_pk(). cdm_add_fk() will find the primary key column of the referenced table by itself and make the indicated column of the child table point to it.

flights_dm_with_key %>% cdm_add_fk(flights, origin, airports)
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   1 primary keys
#>   1 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

This will throw an error:

Let’s create a dm object with a foreign key relation to work with later on:

flights_dm_with_fk <- cdm_add_fk(flights_dm_with_key, flights, origin, airports)

What if we try to add another foreign key relation from flights to airports to the object? Column dest might work, since it also contains airport codes:

flights_dm_with_fk %>% cdm_add_fk(flights, dest, airports)
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   1 primary keys
#>   2 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

As you can see, behind the scenes, checks are executed automatically by the functions of dm to prevent steps that do not comply with the logic of data models.

Use cdm_has_fk() for checking if a foreign key exists, which is pointing from one table to another:

If you want to access the name of the column, which acts as a foreign key of one table to another table’s column, use cdm_get_fk():

Remove foreign key relations with cdm_rm_fk() (parameter column = NULL means removal of all relations):

Since the primary keys are defined in the dm object, you don’t need to provide the referenced column name of ref_table. This is always the table’s primary key column.

Another function for getting to know your data better (cf. cdm_enum_pk_candidates() in “Primary keys of dm objects”) is cdm_enum_fk_candidates(). Use it to get an overview over foreign key candidates pointing from one table to another:

cdm_enum_fk_candidates(flights_dm_with_key, weather, airports)
#> # A tibble: 15 x 6
#>    ref_table ref_table_pk table   column    candidate why                  
#>    <chr>     <chr>        <chr>   <chr>     <lgl>     <chr>                
#>  1 airports  faa          weather origin    TRUE      ""                   
#>  2 airports  faa          weather year      FALSE     not a subset of airp…
#>  3 airports  faa          weather month     FALSE     not a subset of airp…
#>  4 airports  faa          weather day       FALSE     not a subset of airp…
#>  5 airports  faa          weather hour      FALSE     not a subset of airp…
#>  6 airports  faa          weather temp      FALSE     not a subset of airp…
#>  7 airports  faa          weather dewp      FALSE     not a subset of airp…
#>  8 airports  faa          weather humid     FALSE     not a subset of airp…
#>  9 airports  faa          weather wind_dir  FALSE     not a subset of airp…
#> 10 airports  faa          weather wind_spe… FALSE     not a subset of airp…
#> 11 airports  faa          weather wind_gust FALSE     not a subset of airp…
#> 12 airports  faa          weather precip    FALSE     not a subset of airp…
#> 13 airports  faa          weather pressure  FALSE     not a subset of airp…
#> 14 airports  faa          weather visib     FALSE     not a subset of airp…
#> 15 airports  faa          weather time_hour FALSE     not a subset of airp…

Get an overview over all foreign key relations in your data model withcdm_get_all_fks():

#> # A tibble: 4 x 3
#>   child_table child_fk_col parent_table
#>   <chr>       <chr>        <chr>       
#> 1 flights     tailnum      planes      
#> 2 flights     carrier      airlines    
#> 3 flights     origin       airports    
#> 4 flights     dest         airports