The goal of the {dm} package and the dm class that comes with it, is to make your life easier when you are dealing with data from several different tables.

Let’s take a look at the dm class.

Class dm

The dm class consists of a collection of tables and metadata about the tables, such as

  • the names of the tables
  • the names of the columns of the tables
  • the primary and foreign keys of the tables to link the tables together
  • the data (either as data frames or as references to database tables)

All tables in a dm must be obtained from the same data source; csv files and spreadsheets would need to be imported to data frames in R.

Examples of dm objects

The relevant functions for creating dm objects are:

  1. dm()
  2. as_dm()
  3. new_dm()

There are currently three options available for creating a dm object. To illustrate these options, we will now create the same dm in three different ways. We can use the tables from the well-known {nycflights13} package.

Start with an empty dm

Start with an empty dm object that has been created with dm() or new_dm(), and add tables to that object:

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: 
#> Columns: 0
#> Primary keys: 0
#> Foreign keys: 0
dm_add_tbl(empty_dm, airlines, airports, flights, planes, weather) 
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

Coerce a list of tables

Turn a named list of tables into a dm with as_dm():

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

Turn tables from a src into a dm

Squeeze all (or a subset of) tables belonging to a src object into a dm using dm():

library(dplyr)
flights_dm <- dm_from_src(src_df(pkg = "nycflights13"))
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

Here we are making use of the fact that the function dm(src, table_names = NULL) includes all available tables on a source in the dm object, should the table_names-argument be left NULL. This means that you can use this, for example, on a postgres database that you access via src_postgres() (with the appropriate arguments dbname, host, port, …), to produce a dm object with all the tables on the database.

Another way of creating a dm object is calling new_dm() on a list of tbls, e.g., the tables included in the well-known iris dataset:

iris_dm <- new_dm(list("iris1" = iris, "iris2" = iris))
iris_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `iris1`, `iris2`
#> Columns: 10
#> Primary keys: 0
#> Foreign keys: 0

We can get the list of tables with dm_get_tables() and the src object with dm_get_src().

In order to pull a specific table from a dm, use:

tbl(flights_dm, "airports")
#> # A tibble: 1,458 x 8
#>    faa   name                    lat    lon   alt    tz dst   tzone        
#>    <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>        
#>  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
#>  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
#>  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
#>  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
#>  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
#>  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
#>  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
#>  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
#>  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
#> 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
#> # … with 1,448 more rows

But how can we use {dm}-functions to manage the primary keys of the tables in a dm object?

Primary keys of dm objects

Some useful functions for managing primary key settings are:

  1. dm_add_pk()
  2. dm_has_pk()
  3. dm_get_pk()
  4. dm_rm_pk()
  5. dm_enum_pk_candidates()
  6. dm_get_all_pks()

Currently dm objects only support one-column primary keys. If your tables have unique compound keys, adding a surrogate key column might be helpful. If you created a dm object according to the examples in “Examples of dm objects”, your object does not yet have any primary keys set. So let’s add one.

dm_add_pk() has an option to check if the column of the table given by the user is a unique key; for performance reasons, the check will not be executed unless requested. Since the iris dataset does not have any unique one-column keys, we will use the nycflights13 tables, i.e. flights_dm from above.

dm_has_pk(flights_dm, airports)
#> [1] FALSE
flights_dm_with_key <- dm_add_pk(flights_dm, airports, faa)
flights_dm_with_key
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 1
#> Foreign keys: 0

The dm now has a primary key. Let’s check:

dm_has_pk(flights_dm_with_key, airports)
#> [1] TRUE

Get the name of the column that is marked as primary key of the table:

Remove a primary key:

dm_rm_pk(flights_dm_with_key, airports) %>% 
  dm_has_pk(airports)
#> [1] FALSE

If you still need to get to know your data better, and it is already available in the form of a dm object, you can use the dm_enum_pk_candidates() function in order to get information about which columns of the table are unique keys:

dm_enum_pk_candidates(flights_dm_with_key, airports)
#> # A tibble: 8 x 3
#>      columns candidate why                                                 
#>   <list<chr> <lgl>     <chr>                                               
#> 1        [1] TRUE      ""                                                  
#> 2        [1] TRUE      ""                                                  
#> 3        [1] FALSE     "has duplicate values: 0, 1, 3, 4, 5, … (7 total)"  
#> 4        [1] FALSE     "has duplicate values: A, N, U"                     
#> 5        [1] FALSE     "has duplicate values: 38.88944, 40.63975"          
#> 6        [1] FALSE     "has duplicate values: All Airports, Capital City A…
#> 7        [1] FALSE     "has duplicate values: -10, -9, -8, -7, -6, … (7 to…
#> 8        [1] FALSE     "has duplicate values: America/Anchorage, America/C…

The flights table does not have any one-column primary key candidates:

dm_enum_pk_candidates(flights_dm_with_key, flights) %>% count(candidate)
#> # A tibble: 1 x 2
#>   candidate     n
#>   <lgl>     <int>
#> 1 FALSE        19

To get an overview over all tables with primary keys, use dm_get_all_pks():

#> # A tibble: 3 x 2
#>   table         pk_col
#>   <chr>    <list<chr>>
#> 1 airlines         [1]
#> 2 airports         [1]
#> 3 planes           [1]

Here we used the prepared dm object dm_nycflights13(cycle = TRUE) as an example. This object already has all keys pre-set.

Foreign keys

Useful functions for managing foreign key relations include:

  1. dm_add_fk()
  2. dm_has_fk()
  3. dm_get_fk()
  4. dm_rm_fk()
  5. dm_enum_fk_candidates()
  6. dm_get_all_fks()

Now it gets (even more) interesting: we want to define relations between different tables. With the dm_add_fk() function you can define which column of which table points to another table’s column.

This is done by choosing a foreign key from one table that will point to a primary key of another table. The primary key of the referred table must be set with dm_add_pk(). dm_add_fk() will find the primary key column of the referenced table by itself and make the indicated column of the child table point to it.

flights_dm_with_key %>% dm_add_fk(flights, origin, airports)
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 1
#> Foreign keys: 1

This will throw an error:

Let’s create a dm object with a foreign key relation to work with later on:

flights_dm_with_fk <- dm_add_fk(flights_dm_with_key, flights, origin, airports)

What if we tried to add another foreign key relation from flights to airports to the object? Column dest might work, since it also contains airport codes:

As you can see, behind the scenes, checks are executed automatically (unless check = FALSE) by the functions of dm to prevent steps that would result in inconsistent representations.

Use dm_has_fk() for checking if a foreign key exists that is pointing from one table to another:

If you want to access the name of the column which acts as a foreign key of one table to another table’s column, use dm_get_fk():

Remove foreign key relations with dm_rm_fk() (parameter column = NULL means that all relations will be removed):

Since the primary keys are defined in the dm object, you do not need to provide the referenced column name of ref_table. This is always the primary key column of the table.

Another function for getting to know your data better (cf. dm_enum_pk_candidates() in “Primary keys of dm objects”) is dm_enum_fk_candidates(). Use it to get an overview over foreign key candidates that point from one table to another:

dm_enum_fk_candidates(flights_dm_with_key, weather, airports)
#> # A tibble: 15 x 3
#>       columns candidate why                                                
#>    <list<chr> <lgl>     <chr>                                              
#>  1        [1] TRUE      ""                                                 
#>  2        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  3        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  4        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  5        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  6        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  7        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  8        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  9        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 10        [1] FALSE     "cannot join a POSIXct object with an object that …
#> 11        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 12        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 13        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 14        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 15        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…

Get an overview of all foreign key relations withdm_get_all_fks():

#> # A tibble: 4 x 3
#>   child_table child_fk_cols parent_table
#>   <chr>         <list<chr>> <chr>       
#> 1 flights               [1] airlines    
#> 2 flights               [1] airports    
#> 3 flights               [1] airports    
#> 4 flights               [1] planes