This vignette deals with situations, when you want to transform tables of your dm and then update an existing table or add a new one to the dm. There are two straightforward solutions to the problem:

  1. individually access the tables relevant for the calculation, perform the necessary transformations, add the result to the dm (or replace an existing table) and establish the key relations.
  2. do all this within the dm object by zooming to a table and manipulate it while maintaining the key relations whenever possible.

The first approach is rather self-explanatory, so let us have a closer look at the second way.

Enabling {dplyr}-workflow within a dm

Some general information about “zooming” to a table of a dm: - all information stored in the original dm is kept, including the originally zoomed table - an object of class zoomed_dm is produced - a copy of the original table is available for transformations

{dm} provides methods for many of the {dplyr}-verbs for a zoomed_dm which behave the way you are used to, affecting only the zoomed table and leaving the rest of the dm untouched. When you are finished with transforming the table, there are three options to proceed:

  1. use dm_update_zoomed() if you want to replace the originally zoomed table with the new table
  2. with dm_insert_zoomed() you are creating a new table for your dm
  3. use dm_discard_zoomed() if you do not need the result and want to discard it

When employing one of the first two options, the resulting table in the dm will have all (primary as well as foreign) keys available, that could be tracked from the originally zoomed table.

Examples

So much to the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data.

Use case 1: Add a new column to an existing table

Imagine you want to have a column in flights, specifying if a flight left before noon or after. Just like with {dplyr}, we can tackle this with mutate(). Let us do this step by step:

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 3
#> # Zoomed table: flights
#> # A tibble:     11,227 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1    10        3           2359         4      426
#>  2  2013     1    10       16           2359        17      447
#>  3  2013     1    10      450            500       -10      634
#>  4  2013     1    10      520            525        -5      813
#>  5  2013     1    10      530            530         0      824
#>  6  2013     1    10      531            540        -9      832
#>  7  2013     1    10      535            540        -5     1015
#>  8  2013     1    10      546            600       -14      645
#>  9  2013     1    10      549            600       -11      652
#> 10  2013     1    10      550            600       -10      649
#> # … with 11,217 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>
#> # Zoomed table: flights
#> # A tibble:     11,227 x 20
#>     year month   day dep_time am_pm_dep sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int> <chr>              <int>     <dbl>    <int>
#>  1  2013     1    10        3 am                  2359         4      426
#>  2  2013     1    10       16 am                  2359        17      447
#>  3  2013     1    10      450 am                   500       -10      634
#>  4  2013     1    10      520 am                   525        -5      813
#>  5  2013     1    10      530 am                   530         0      824
#>  6  2013     1    10      531 am                   540        -9      832
#>  7  2013     1    10      535 am                   540        -5     1015
#>  8  2013     1    10      546 am                   600       -14      645
#>  9  2013     1    10      549 am                   600       -11      652
#> 10  2013     1    10      550 am                   600       -10      649
#> # … with 11,217 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 54
#> Primary keys: 3
#> Foreign keys: 3
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

Use case 2: Creation of a surrogate key

Exactly such a course of action could for example be employed to create a surrogate key for a table. We can do this for the weather table.

#> # Zoomed table: weather
#> # A tibble:     26,115 x 15
#>    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
#>    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
#>  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
#>  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
#>  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
#>  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
#>  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
#>  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
#>  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
#>  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
#>  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
#> 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
#> # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
#> #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
#> # A tibble: 15 x 3
#>    columns    candidate why                                                
#>    <keys>     <lgl>     <chr>                                              
#>  1 day        FALSE     has duplicate values: 1, 2, 3, 4, 5, … (>= 7 total)
#>  2 dewp       FALSE     has duplicate values: -9.94, -9.04, -7.96, -7.06, …
#>  3 hour       FALSE     has duplicate values: 0, 1, 2, 3, 4, … (>= 7 total)
#>  4 humid      FALSE     has duplicate values: 15.59, 16.15, 16.68, 16.72, …
#>  5 month      FALSE     has duplicate values: 1, 2, 3, 4, 5, … (>= 7 total)
#>  6 origin     FALSE     has duplicate values: EWR, JFK, LGA                
#>  7 precip     FALSE     has duplicate values: 0.00, 0.01, 0.02, 0.03, 0.04…
#>  8 pressure   FALSE     has duplicate values: 993.9, 994.1, 994.2, 994.4, …
#>  9 temp       FALSE     has duplicate values: 10.94, 12.02, 12.92, 14.00, …
#> 10 time_hour  FALSE     has duplicate values: 2013-01-01 01:00:00, 2013-01…
#> 11 visib      FALSE     has duplicate values: 0.00, 0.06, 0.12, 0.25, 0.50…
#> 12 wind_dir   FALSE     has duplicate values: 0, 10, 20, 30, 40, … (>= 7 t…
#> 13 wind_gust  FALSE     has duplicate values: 16.11092, 17.26170, 18.41248…
#> 14 wind_speed FALSE     has duplicate values: 0.00000, 3.45234, 4.60312, 5…
#> 15 year       FALSE     has duplicate values: 2013
#> # A tibble: 1 x 3
#>   columns        candidate why  
#>   <keys>         <lgl>     <chr>
#> 1 origin_slot_id TRUE      ""
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 3
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumorigin_slot_idoriginflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin_slot_idflights:origin_slot_id->weather:origin_slot_id

Use case 3: Disentangle dm

If you look at the dm created by dm_nycflights13(cycle = TRUE), you see that two columns of flights relate to one and the same table, airports. One column stands for the departure airport and the other for the arrival airport.

dm_draw(dm_nycflights13(cycle = TRUE))
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumorigindestflights:carrier->airlines:carrier flights:origin->airports:faa flights:dest->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

In such cases it can be beneficial, to “disentangle” the dm by duplicating the referred table. One way to do this in the {dm}-framework is as follows:

%0 airlines airlinescarrierdestination destinationfaaflights flightscarriertailnumorigindestflights:carrier->airlines:carrier flights:dest->destination:faa origin originfaaflights:origin->origin:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

In a future update we will provide a more convenient way to “disentangle” dm objects, so that the individual steps will be done automatically.

Use case 4: Add summary table to dm

Here is an example for adding a summary of a table as a new table to a dm (FK-relations are taken care of automatically):

dm_with_summary <- 
  flights_dm %>% 
  dm_zoom_to(flights) %>% 
  count(origin, carrier) %>% 
  dm_insert_zoomed("dep_carrier_count")
dm_draw(dm_with_summary)
%0 airlines airlinescarrierairports airportsfaadep_carrier_count dep_carrier_countorigincarrierdep_carrier_count:carrier->airlines:carrier dep_carrier_count:origin->airports:faa flights flightscarriertailnumoriginflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weather

Use case 5: Joining tables

If you would like to join some or all of the columns of one table to another, you can make use of one of the join-methods for a zoomed_dm. In addition to the usual arguments for the {dplyr}-joins, you can specify in an argument select, which columns of the RHS-table you want to be included in the join. For the syntax please see the example below. The LHS-table of a join is always the zoomed table.

#> # A tibble: 11,227 x 11
#>     year month   day dep_time sched_dep_time carrier flight tailnum origin
#>    <int> <int> <int>    <int>          <int> <chr>    <int> <chr>   <chr> 
#>  1  2013     1    10        3           2359 B6         727 N571JB  JFK   
#>  2  2013     1    10       16           2359 B6         739 N564JB  JFK   
#>  3  2013     1    10      450            500 US        1117 N171US  EWR   
#>  4  2013     1    10      520            525 UA        1018 N35204  EWR   
#>  5  2013     1    10      530            530 UA         404 N815UA  LGA   
#>  6  2013     1    10      531            540 AA        1141 N5EAAA  JFK   
#>  7  2013     1    10      535            540 B6         725 N784JB  JFK   
#>  8  2013     1    10      546            600 B6         380 N337JB  EWR   
#>  9  2013     1    10      549            600 EV        6055 N19554  LGA   
#> 10  2013     1    10      550            600 US        2114 N740UW  LGA   
#> # … with 11,217 more rows, and 2 more variables: dest <chr>,
#> #   plane_type <chr>
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum flights_plane_type flights_plane_typecarriertailnumoriginflights_plane_type:carrier->airlines:carrier flights_plane_type:origin->airports:faa flights_plane_type:tailnum->planes:tailnum weather weather

Tip: Accessing the zoomed table

At each point you can retrieve the zoomed table by calling pull_tbl() on a zoomed_dm. To use our last example once more:

flights_dm %>% 
  dm_zoom_to(flights) %>% 
  select(-dep_delay:-arr_delay, -air_time:-time_hour) %>% 
  left_join(planes, select = c(tailnum, plane_type = type)) %>% 
  pull_tbl()
#> # A tibble: 11,227 x 11
#>     year month   day dep_time sched_dep_time carrier flight tailnum origin
#>    <int> <int> <int>    <int>          <int> <chr>    <int> <chr>   <chr> 
#>  1  2013     1    10        3           2359 B6         727 N571JB  JFK   
#>  2  2013     1    10       16           2359 B6         739 N564JB  JFK   
#>  3  2013     1    10      450            500 US        1117 N171US  EWR   
#>  4  2013     1    10      520            525 UA        1018 N35204  EWR   
#>  5  2013     1    10      530            530 UA         404 N815UA  LGA   
#>  6  2013     1    10      531            540 AA        1141 N5EAAA  JFK   
#>  7  2013     1    10      535            540 B6         725 N784JB  JFK   
#>  8  2013     1    10      546            600 B6         380 N337JB  EWR   
#>  9  2013     1    10      549            600 EV        6055 N19554  LGA   
#> 10  2013     1    10      550            600 US        2114 N740UW  LGA   
#> # … with 11,217 more rows, and 2 more variables: dest <chr>,
#> #   plane_type <chr>

Possible pitfalls and caveats

  1. Currently not all of the {dplyr}-verbs have their own method for a zoomed_dm, so be aware that in some cases it will still be necessary to resort to extracting one or more tables from a dm and reinserting a transformed version of theirs into the dm eventually. The supported functions are: group_by(), ungroup(), summarise(), mutate(), transmute(), filter(), select(), rename(), distinct(), arrange(), slice(), left_join(), inner_join(), full_join(), right_join(), semi_join() and anti_join().

  2. The same is true for {tidyr}-functions. Methods are provided for: unite() and separate().

  3. There might be situations when you would like the key relations to remain intact, but they are dropped nevertheless. This is because a rigid logic is implemented, that does drop a key when its associated column is acted upon with e.g. a mutate() call. In these cases the key relations will need to be established once more after finishing with the manipulations.