Multiple, linked tables are common within computer science. Because many R users have backgrounds in other disciplines, we present six important terms in relational data modeling to help you to jump-start working with {dm}. These terms are:

  1. Data Frames and Tables
  2. Data Model
  3. Primary Keys
  4. Foreign Keys
  5. Normalization
  6. Relational Databases

1. Data Frames and Tables

A data frame is a fundamental data structure in R. Columns represent variables, rows represent observations. In more technical terms: a data frame is a list of variables of identical length and unique row names. If you imagine it visually, the result is a typical table structure. That is why working with data from spreadsheets is so convenient and the users of the the popular {dplyr} package for data wrangling mainly rely on data frames.

The downside is that data frames and flat file systems like spreadsheets can result in bloated tables because they hold many repetitive values. In the worst case, a data frame can contain multiple columns with only a single value different in each row.

This calls for better data organization by utilizing the resemblance between data frames and database tables, which also consist of columns and rows. The elements are just called differently:

Data Frame Table
Column Attribute
Row Tuple

The separation of data into multiple tables helps to improve data quality but in order to take full advantage of the approach, an associated data model is needed to overcome the difficulties that arise with joining multiple tables.

Let’s illustrate this challenge with the data from the nycflights13 dataset that contains detailed information about the 336776 flights that departed from New York City in 2013. The information is stored in five tables.

Details like the full name of an airport are not available immediately; these can only be obtained by joining or merging the constituent tables, which can result in long and inflated pipe chains full of left_join(), anti_join() and other forms of data merging.

In classical {dplyr} notation, you will need three left_join() calls to merge the flights table gradually to airlines, planes and airports tables to create one wide data frame.

library(tidyverse)
library(dm)
library(nycflights13)

flights %>% 
  left_join(airlines, by = "carrier") %>%
  left_join(planes, by = "tailnum") %>%
  left_join(airports, by = c("origin" = "faa"))
#> # A tibble: 336,776 x 35
#>    year.x month   day dep_time sched_dep_time dep_delay arr_time
#>     <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1   2013     1     1      517            515         2      830
#>  2   2013     1     1      533            529         4      850
#>  3   2013     1     1      542            540         2      923
#>  4   2013     1     1      544            545        -1     1004
#>  5   2013     1     1      554            600        -6      812
#>  6   2013     1     1      554            558        -4      740
#>  7   2013     1     1      555            600        -5      913
#>  8   2013     1     1      557            600        -3      709
#>  9   2013     1     1      557            600        -3      838
#> 10   2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 28 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name.x <chr>, year.y <int>,
#> #   type <chr>, manufacturer <chr>, model <chr>, engines <int>,
#> #   seats <int>, speed <int>, engine <chr>, name.y <chr>, lat <dbl>,
#> #   lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

{dm} offers a more elegant and shorter way to combine tables while augmenting {dplyr}/{dbplyr} workflows.

It is possible to have the best of both worlds: manage your data with {dm} as linked tables, then flatten multiple tables into one for your analysis with {dplyr} on an as-needed basis.

The next step is to create a data model based on multiple tables:

2. Data Model

A data model shows the structure between multiple tables that can be linked together.

The nycflights13 relations can be transferred into the following graphical representation:

%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumorigindestflights:carrier->airlines:carrier flights:origin->airports:faa flights:dest->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

The flights table is linked to three other tables: airlines, planes and airports. By using directed arrows, the visualization shows explicitly the connection between different columns (they are called attributes in the relational data sphere).

For example: The column carrier in flights can be joined with the column carrier from the airlines table.

The links between the tables are established through primary keys and foreign keys.

Further Reading: The {dm} methods for visualizing data models.

3. Primary Keys

In a relational data model, every table needs to have one column or attribute that uniquely identifies a row. This column is called the primary key (abbreviated with pk). A primary key can be either an existing column that satifies the condition of being unique, or a new column that assigns an identifier.

Example: In the airlines table of nycflights13 the column carrier is the primary key.

You can get all primary keys in a dm by calling dm_get_all_pks():

#> # A tibble: 3 x 2
#>   table         pk_col
#>   <chr>    <list<chr>>
#> 1 airlines         [1]
#> 2 airports         [1]
#> 3 planes           [1]

If an attribute is suitable as a primary key, it can be checked with dm_enum_pk_candidates(). Which columns of the airlines table can serve as a primary key?

#> # A tibble: 8 x 3
#>      columns candidate why                                                 
#>   <list<chr> <lgl>     <chr>                                               
#> 1        [1] TRUE      ""                                                  
#> 2        [1] TRUE      ""                                                  
#> 3        [1] FALSE     "has duplicate values: 0, 1, 3, 4, 5, … (7 total)"  
#> 4        [1] FALSE     "has duplicate values: A, N, U"                     
#> 5        [1] FALSE     "has duplicate values: 38.88944, 40.63975"          
#> 6        [1] FALSE     "has duplicate values: All Airports, Capital City A…
#> 7        [1] FALSE     "has duplicate values: -10, -9, -8, -7, -6, … (7 to…
#> 8        [1] FALSE     "has duplicate values: America/Anchorage, America/C…

Further Reading: The {dm} package offers several function for dealing with primary keys.

4. Foreign Keys

The counterpart of a primary key in one table is the foreign key in another table. In order to join two tables, the primary key of the first table needs to be available in the second table as well. This second column is called the foreign key (abbreviated with fk).

For example, if you want to link the airlines table to the flights table, the primary key in airlines needs to match the foreign key in flights. This condition is satisfied because the column carrier is present as a primary key in the airlines table as well as a foreign key in the flights table. You can find foreign key candidates with the function dm_enum_fk_candidates(), they are marked with TRUE in the candidate column.

#> # A tibble: 19 x 3
#>       columns candidate why                                                
#>    <list<chr> <lgl>     <chr>                                              
#>  1        [1] TRUE      ""                                                 
#>  2        [1] FALSE     "334264 entries (99.3%) of `flights$tailnum` not i…
#>  3        [1] FALSE     "336776 entries (100%) of `flights$dest` not in `a…
#>  4        [1] FALSE     "336776 entries (100%) of `flights$origin` not in …
#>  5        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  6        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  7        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  8        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#>  9        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 10        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 11        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 12        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 13        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 14        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 15        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 16        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 17        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…
#> 18        [1] FALSE     "cannot join a POSIXct object with an object that …
#> 19        [1] FALSE     "Can't join on 'value' x 'value' because of incomp…

After finding and assigning foreign keys, get the name of the set foreign key:

Further Reading: All {dm} functions for working with foreign keys.

5. Normalization

Normalization is a technical term that describes the central design principle of a relational data model: splitting data into multiple tables.

A normalized data schema consists of several relations (tables) that are linked with attributes (columns). The relations can be joined together by means of primary and foreign keys. The main goal of normalization is to keep data organization as clean and simple as possible by avoiding redundant data entries.

For example, if you want to change the name of one airport in the nycflights13 dataset, you will only need to update a single data value. This principle is sometimes called the single point of truth.

#> # A tibble: 1 x 2
#>   carrier name                  
#>   <chr>   <chr>                 
#> 1 UA      United broke my guitar
#> # A tibble: 336,776 x 2
#>    flight name                    
#>     <int> <chr>                   
#>  1   1545 United broke my guitar  
#>  2   1714 United broke my guitar  
#>  3   1141 American Airlines Inc.  
#>  4    725 JetBlue Airways         
#>  5    461 Delta Air Lines Inc.    
#>  6   1696 United broke my guitar  
#>  7    507 JetBlue Airways         
#>  8   5708 ExpressJet Airlines Inc.
#>  9     79 JetBlue Airways         
#> 10    301 American Airlines Inc.  
#> # … with 336,766 more rows

Another way to demonstrate normalization is splitting a table in two parts.

Let’s look at the planes table, which consists of 3322 individual tail numbers and corresponding information for the specific airplane, like the year it was manufactured or the average cruising speed.

The function decompose_table() extracts two new tables and creates a new key model_id, that links both tables.

This results in a parent_table and a child_table that differ massively in the number of rows:

planes %>%
  decompose_table(model_id, model, manufacturer, type, engines, seats, manufacturer, speed)
#> $child_table
#> # A tibble: 3,322 x 4
#>    tailnum  year engine    model_id
#>    <chr>   <int> <chr>        <int>
#>  1 N10156   2004 Turbo-fan      120
#>  2 N102UW   1998 Turbo-fan       94
#>  3 N103US   1999 Turbo-fan       94
#>  4 N104UW   1999 Turbo-fan       94
#>  5 N10575   2002 Turbo-fan      119
#>  6 N105UW   1999 Turbo-fan       94
#>  7 N107US   1999 Turbo-fan       94
#>  8 N108UW   1999 Turbo-fan       94
#>  9 N109UW   1999 Turbo-fan       94
#> 10 N110UW   1999 Turbo-fan       94
#> # … with 3,312 more rows
#> 
#> $parent_table
#> # A tibble: 147 x 7
#>    model_id model     manufacturer type                 engines seats speed
#>       <int> <chr>     <chr>        <chr>                  <int> <int> <int>
#>  1        1 150       CESSNA       Fixed wing single e…       1     2    90
#>  2        2 172E      CESSNA       Fixed wing single e…       1     4   105
#>  3        3 172M      CESSNA       Fixed wing single e…       1     4   108
#>  4        4 172N      CESSNA       Fixed wing single e…       1     4   105
#>  5        5 206B      BELL         Rotorcraft                 1     5   112
#>  6        6 210-5(20… CESSNA       Fixed wing single e…       1     6    NA
#>  7        7 230       BELL         Rotorcraft                 2    11    NA
#>  8        8 310Q      CESSNA       Fixed wing multi en…       2     6   167
#>  9        9 421C      CESSNA       Fixed wing multi en…       2     8    90
#> 10       10 550       CESSNA       Fixed wing multi en…       2     8    NA
#> # … with 137 more rows

While child_table contains 3322 unique tailnum rows and therefore consists of 3322 rows, just like the original planes table, the parent_table shrunk to just 147 rows, enough to store all relevant combinations and avoid storing redundant information.

Further Reading: See the Simple English Wikipedia article on database normalisation for more details.

6. Relational Databases

{dm} is built upon relational data models but it is not a database itself. Databases are systems for data management and many of them are constructed as relational databases, e.g. SQLite, MySQL, MSSQL, Postgres. As you can guess from the names of the databases, SQL, short for structured querying language, plays an important role: it was invented for the purpose of querying relational databases.

In production, the data is stored in a relational database and {dm} is used to work with the data.

Therefore, {dm} can copy data from and to databases, and works transparently with both in-memory data and with relational database systems.

For example, let’s create a local Sqlite database and copy the dm object to it:

In the opposite direction, dm can also be populated with data from a DB. Unfortunately, this currently works only for MSSQL and Postgres, but not for Sqlite. Therefore, we cannot show the results here just yet:

dm_learn_from_db(src_sqlite)