The goal of the {dm} package and the dm
class that comes with it, is to make your life easier when you are dealing with data from several different tables.
Let’s take a look at the dm
class.
dm
The dm
class consists of a collection of tables and metadata about the tables, such as
All tables in a dm
must be obtained from the same data source; csv files and spreadsheets would need to be imported to data frames in R.
dm
objectsThere are currently three options available for creating a dm
object. The relevant functions for creating dm
objects are:
To illustrate these options, we will now create the same dm
in several different ways. We can use the tables from the well-known {nycflights13} package.
Create a dm
object directly by providing data frames to dm()
:
library(nycflights13)
library(dm)
dm(airlines, airports, flights, planes, weather)
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
dm
Start with an empty dm
object that has been created with dm()
or new_dm()
, and add tables to that object:
library(nycflights13)
library(dm)
empty_dm <- dm()
empty_dm
#> dm()
dm_add_tbl(empty_dm, airlines, airports, flights, planes, weather)
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
Turn a named list of tables into a dm
with as_dm()
:
as_dm(list(airlines = airlines,
airports = airports,
flights = flights,
planes = planes,
weather = weather))
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
src
into a dm
Squeeze all (or a subset of) tables belonging to a src
object into a dm
using dm_from_src()
:
sqlite_src <- dbplyr::nycflights13_sqlite()
flights_dm <- dm_from_src(sqlite_src)
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.34.1 [/tmp/RtmpD87ibw/nycflights13.sqlite]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `sqlite_stat1`, … (7 total)
#> Columns: 62
#> Primary keys: 0
#> Foreign keys: 0
The function dm_from_src(src, table_names = NULL)
includes all available tables on a source in the dm
object. This means that you can use this, for example, on a postgres database that you access via src_postgres()
(with the appropriate arguments dbname
, host
, port
, …), to produce a dm
object with all the tables on the database.
Another way of creating a dm
object is calling new_dm()
on a list of tbl
objects:
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `trees`, `mtcars`
#> Columns: 14
#> Primary keys: 0
#> Foreign keys: 0
This constructor is optimized for speed and does not perform integrity checks. Use with caution, validate using validate_dm()
if necessary.
validate_dm(base_dm)
We can get the list of tables with dm_get_tables()
and the src
object with dm_get_src()
.
In order to pull a specific table from a dm
, use:
tbl(flights_dm, "airports")
#> # Source: table<`airports`> [?? x 8]
#> # Database: sqlite 3.34.1 [/tmp/RtmpD87ibw/nycflights13.sqlite]
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_…
#> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_…
#> 6 0A9 Elizabethton Municip… 36.4 -82.2 1593 -5 A America/New_…
#> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_…
#> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_…
#> 9 0P2 Shoestring Aviation … 39.8 -76.6 1000 -5 U America/New_…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_…
#> # … with more rows
But how can we use {dm}-functions to manage the primary keys of the tables in a dm
object?
dm
objectsSome useful functions for managing primary key settings are:
Currently dm
objects only support one-column primary keys. If your tables have unique compound keys, adding a surrogate key column might be helpful. If you created a dm
object according to the examples in “Examples of dm
objects”, your object does not yet have any primary keys set. So let’s add one.
dm_add_pk()
has an option to check if the column of the table given by the user is a unique key; for performance reasons, the check will not be executed unless requested. We use the nycflights13
tables, i.e. flights_dm
from above.
dm_has_pk(flights_dm, airports)
#> [1] FALSE
flights_dm_with_key <- dm_add_pk(flights_dm, airports, faa)
flights_dm_with_key
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.34.1 [/tmp/RtmpD87ibw/nycflights13.sqlite]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `sqlite_stat1`, … (7 total)
#> Columns: 62
#> Primary keys: 1
#> Foreign keys: 0
The dm
now has a primary key. Let’s check:
dm_has_pk(flights_dm_with_key, airports)
#> [1] TRUE
Get the name of the column that is marked as primary key of the table:
dm_get_pk(flights_dm_with_key, airports)
#> <list_of<character>[1]>
#> [[1]]
#> [1] "faa"
Remove a primary key:
If you still need to get to know your data better, and it is already available in the form of a dm
object, you can use the dm_enum_pk_candidates()
function in order to get information about which columns of the table are unique keys:
dm_enum_pk_candidates(flights_dm_with_key, airports)
#> # A tibble: 8 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 faa TRUE ""
#> 2 lon TRUE ""
#> 3 alt FALSE "has duplicate values: 0, 1, 3, 4, 5, …"
#> 4 dst FALSE "has duplicate values: A, N, U"
#> 5 lat FALSE "has duplicate values: 38.88944, 40.63975"
#> 6 name FALSE "has duplicate values: All Airports, Capital City Airp…
#> 7 tz FALSE "has duplicate values: -10, -9, -8, -7, -6, …"
#> 8 tzone FALSE "has missing values, and duplicate values: America/Anc…
The flights
table does not have any one-column primary key candidates:
dm_enum_pk_candidates(flights_dm_with_key, flights) %>% dplyr::count(candidate)
#> # A tibble: 1 x 2
#> candidate n
#> * <lgl> <int>
#> 1 FALSE 19
To get an overview over all tables with primary keys, use dm_get_all_pks()
:
dm_get_all_pks(dm_nycflights13(cycle = TRUE))
#> # A tibble: 3 x 2
#> table pk_col
#> <chr> <keys>
#> 1 airlines carrier
#> 2 airports faa
#> 3 planes tailnum
Here we used the prepared dm
object dm_nycflights13(cycle = TRUE)
as an example. This object already has all keys pre-set.
Useful functions for managing foreign key relations include:
Now it gets (even more) interesting: we want to define relations between different tables. With the dm_add_fk()
function you can define which column of which table points to another table’s column.
This is done by choosing a foreign key from one table that will point to a primary key of another table. The primary key of the referred table must be set with dm_add_pk()
. dm_add_fk()
will find the primary key column of the referenced table by itself and make the indicated column of the child table point to it.
flights_dm_with_key %>% dm_add_fk(flights, origin, airports)
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.34.1 [/tmp/RtmpD87ibw/nycflights13.sqlite]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `sqlite_stat1`, … (7 total)
#> Columns: 62
#> Primary keys: 1
#> Foreign keys: 1
This will throw an error:
flights_dm %>% dm_add_fk(flights, origin, airports)
#> Error: ref_table `airports` needs a primary key first. Use `dm_enum_pk_candidates()` to find appropriate columns and `dm_add_pk()` to define a primary key.
Let’s create a dm
object with a foreign key relation to work with later on:
flights_dm_with_fk <- dm_add_fk(flights_dm_with_key, flights, origin, airports)
What if we tried to add another foreign key relation from flights
to airports
to the object? Column dest
might work, since it also contains airport codes:
flights_dm_with_fk %>% dm_add_fk(flights, dest, airports, check = TRUE)
#> Error: Column `dest` of table `flights` contains values (see examples above) that are not present in column `faa` of table `airports`.
As you can see, behind the scenes, checks are executed automatically (unless check = FALSE
) by the functions of dm
to prevent steps that would result in inconsistent representations.
Use dm_has_fk()
for checking if a foreign key exists that is pointing from one table to another:
flights_dm_with_fk %>% dm_has_fk(flights, planes)
#> [1] FALSE
flights_dm_with_fk %>% dm_has_fk(flights, airports)
#> [1] TRUE
If you want to access the name of the column which acts as a foreign key of one table to another table’s column, use dm_get_fk()
:
flights_dm_with_fk %>% dm_get_fk(flights, planes)
#> <list_of<character>[0]>
flights_dm_with_fk %>% dm_get_fk(flights, airports)
#> <list_of<character>[1]>
#> [[1]]
#> [1] "origin"
Remove foreign key relations with dm_rm_fk()
(parameter column = NULL
means that all relations will be removed):
flights_dm_with_fk %>%
dm_rm_fk(table = flights, column = dest, ref_table = airports) %>%
dm_get_fk(flights, airports)
#> Error: (`dest`) is not a foreign key of table `flights` into table `airports`.
flights_dm_with_fk %>%
dm_rm_fk(flights, origin, airports) %>%
dm_get_fk(flights, airports)
#> <list_of<character>[0]>
flights_dm_with_fk %>%
dm_rm_fk(flights, NULL, airports) %>%
dm_get_fk(flights, airports)
#> <list_of<character>[0]>
Since the primary keys are defined in the dm
object, you do not need to provide the referenced column name of ref_table
. This is always the primary key column of the table.
Another function for getting to know your data better (cf. dm_enum_pk_candidates()
in “Primary keys of dm
objects”) is dm_enum_fk_candidates()
. Use it to get an overview over foreign key candidates that point from one table to another:
dm_enum_fk_candidates(flights_dm_with_key, weather, airports)
#> # A tibble: 15 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 origin TRUE ""
#> 2 wind_gust FALSE "5337 entries (20.4%) of `weather$wind_gust` not i…
#> 3 pressure FALSE "23386 entries (89.6%) of `weather$pressure` not i…
#> 4 wind_dir FALSE "25655 entries (98.2%) of `weather$wind_dir` not i…
#> 5 wind_speed FALSE "26111 entries (100%) of `weather$wind_speed` not …
#> 6 dewp FALSE "26114 entries (100%) of `weather$dewp` not in `ai…
#> 7 humid FALSE "26114 entries (100%) of `weather$humid` not in `a…
#> 8 temp FALSE "26114 entries (100%) of `weather$temp` not in `ai…
#> 9 day FALSE "26115 entries (100%) of `weather$day` not in `air…
#> 10 hour FALSE "26115 entries (100%) of `weather$hour` not in `ai…
#> 11 month FALSE "26115 entries (100%) of `weather$month` not in `a…
#> 12 precip FALSE "26115 entries (100%) of `weather$precip` not in `…
#> 13 time_hour FALSE "26115 entries (100%) of `weather$time_hour` not i…
#> 14 visib FALSE "26115 entries (100%) of `weather$visib` not in `a…
#> 15 year FALSE "26115 entries (100%) of `weather$year` not in `ai…
Get an overview of all foreign key relations withdm_get_all_fks()
:
dm_get_all_fks(dm_nycflights13(cycle = TRUE))
#> # A tibble: 4 x 3
#> child_table child_fk_cols parent_table
#> <chr> <keys> <chr>
#> 1 flights carrier airlines
#> 2 flights dest airports
#> 3 flights origin airports
#> 4 flights tailnum planes