This document describes how you can get your data into a dm object.

Although the example that we will be using is included in dm, and you can see it by running dm_nycflights13(), we will be going through the built-in example one more time here, step by step.

The five tables that we are working with contain information about all flights that departed from the airports of New York to other destinations in the United States in 2013, and are available through the nycflights13 package:

  • flights represents the trips taken by planes
  • airlines includes
    • the names of transport organizations (name)
    • their abbreviated codes (carrier)
  • airports indicates the ports of departure (origin) and of destination (dest)
  • weather contains meterological information at each hour
  • planes describes characteristics of aircrafts

Once we’ve loaded nycflights13, the aforementioned tables are all in our work environment, ready to be accessed.

library(dm)
library(nycflights13)

Adding Tables

First, we will tell dm which tables we want to work with and how they are connected. For that, we can use dm or as_dm(), which accepts lists of objects. You must explicitly name the objects if you use list() (e.g., list("airlines" = airlines, "flights" = flights)) — here we use tibble::lst(), which automatically names the components of the list.

flights_dm_no_keys <- tibble::lst(airlines, airports, flights, planes, weather) %>%
  as_dm()

Defining Keys

Even though you now have a new dm object that contains all your data, some key details are still missing that determine how your five tables are connected (the foreign keys), and which column(s) uniquely identify the observations (the primary keys).

Primary Keys

dm offers dm_enum_pk_candidates() to identify primary keys and dm_add_pk() to add them.

## # A tibble: 9 x 3
##      columns candidate why                                                      
##   <list<chr> <lgl>     <chr>                                                    
## 1        [1] TRUE      ""                                                       
## 2        [1] FALSE     "has duplicate values: 4 Cycle, Reciprocating, Turbo-fan…
## 3        [1] FALSE     "has duplicate values: 1, 2, 3, 4"                       
## 4        [1] FALSE     "has duplicate values: AIRBUS, AIRBUS INDUSTRIE, AMERICA…
## 5        [1] FALSE     "has duplicate values: 717-200, 737-301, 737-3G7, 737-3H…
## 6        [1] FALSE     "has duplicate values: 2, 4, 5, 6, 7, … (7 total)"       
## 7        [1] FALSE     "has duplicate values: 90, 105, 162, 432, NA"            
## 8        [1] FALSE     "has duplicate values: Fixed wing multi engine, Fixed wi…
## 9        [1] FALSE     "has duplicate values: 1959, 1963, 1975, 1976, 1977, … (…

Now, add the primary keys that you have identified:

flights_dm_only_pks <- flights_dm_no_keys %>%
  dm_add_pk(table = airlines, column = carrier) %>%
  dm_add_pk(airports, faa) %>%
  dm_add_pk(planes, tailnum)
flights_dm_only_pks
## ── Table source ────────────────────────────────────────────────────────────────
## src:  <environment: R_GlobalEnv>
## ── Metadata ────────────────────────────────────────────────────────────────────
## Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
## Columns: 53
## Primary keys: 3
## Foreign keys: 0

To review the primary keys after setting them, call dm_get_all_pks().

dm_get_all_pks(flights_dm_only_pks)
## # A tibble: 3 x 2
##   table         pk_col
##   <chr>    <list<chr>>
## 1 airlines         [1]
## 2 airports         [1]
## 3 planes           [1]

Foreign Keys

## # A tibble: 19 x 3
##       columns candidate why                                                     
##    <list<chr> <lgl>     <chr>                                                   
##  1        [1] TRUE      ""                                                      
##  2        [1] FALSE     "334264 entries (99.3%) of `flights$tailnum` not in `ai…
##  3        [1] FALSE     "336776 entries (100%) of `flights$dest` not in `airlin…
##  4        [1] FALSE     "336776 entries (100%) of `flights$origin` not in `airl…
##  5        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
##  6        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
##  7        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
##  8        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
##  9        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 10        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 11        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 12        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 13        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 14        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 15        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 16        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 17        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…
## 18        [1] FALSE     "cannot join a POSIXct object with an object that is no…
## 19        [1] FALSE     "Can't join on 'value' x 'value' because of incompatibl…

To define how your tables are related, use dm_add_fk() to add foreign keys. First, define the tables that you wish to connect by parameterizing the dm_add_fk() function with table and ref_table options.

Then indicate in column which column of table refers to ref_table’s primary key, which you’ve defined above. Use check = FALSE to omit consistency checks.

flights_dm_all_keys <- flights_dm_only_pks %>%
  dm_add_fk(table = flights, column = tailnum, ref_table = planes, check = FALSE) %>%
  dm_add_fk(flights, carrier, airlines) %>%
  dm_add_fk(flights, origin, airports)
flights_dm_all_keys
## ── Table source ────────────────────────────────────────────────────────────────
## src:  <environment: R_GlobalEnv>
## ── Metadata ────────────────────────────────────────────────────────────────────
## Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
## Columns: 53
## Primary keys: 3
## Foreign keys: 3

Retrieving Keys

To retrieve your keys later on, use dm_get_all_fks(), or dm_get_fk() for its singular version.

dm_get_all_pks(flights_dm_all_keys)
## # A tibble: 3 x 2
##   table         pk_col
##   <chr>    <list<chr>>
## 1 airlines         [1]
## 2 airports         [1]
## 3 planes           [1]

Voilà, here’s your dm object that you can work with:

## ── Table source ────────────────────────────────────────────────────────────────
## src:  <environment: R_GlobalEnv>
## ── Metadata ────────────────────────────────────────────────────────────────────
## Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
## Columns: 53
## Primary keys: 3
## Foreign keys: 3