Here we want to present some specialised functions for conducting some basic tests regarding key conditions and generally the relations between tables. Additionally we will show some functions for splitting and uniting tables.

Testing key constraints

This section contains information and examples for the functions:

  1. check_key(.data, ...)
  2. check_if_subset(t1, c1, t2, c2)
  3. check_set_equality(t1, c1, t2, c2)

When you have tables (data frames) that are connected by key relations, this package might help you verify the assumed key relations and/or determine existing key relations between the tables. For example, if you have tables:

data_1 <- tibble(a = c(1, 2, 1), b = c(1, 4, 1), c = c(5, 6, 7))
data_2 <- tibble(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))

and you want to know if a is a primary key for data_1, you can use the check_key() function:

Mind the error message when a test is not passed.

For data_2, column a is a key:

check_key(data_2, a)

To see if a column of one table contains only values that are also present in another column of another table, you can use check_if_subset():

check_if_subset(data_1, a, data_2, a)

This is important to understand if this column is a foreign key to the other table. What about the inverse relation?

One should keep in mind, that check_if_subset() does not test if column c2 is a unique key of table t2. In order to find out if a (child) table t1 contains a column c1 that is a foreign key to a (parent) table t2 with the corresponding column c2, one would use the following approach:

check_key(t2, c2)
check_if_subset(t1, c1, t2, c2)

To check both directions at once - basically answering the questions: are the unique values of c_1 in t_1 the same as those of c_2 in t_2? - {dm} provides the function check_set_equality() (this would fail with a longer error message and is therefore not evaluated):

check_set_equality(data_1, a, data_2, a)

Bringing one more table into the game, we can show how it looks, when the test is passed:

data_3 <- tibble(a = c(2, 1, 2), b = c(4, 5, 6), c = c(7, 8, 9))

check_set_equality(data_1, a, data_3, a)

If the test is passed, the return value of the function will be the first table parameter (invisibly). This ensures that the functions can be used in a pipe conveniently.

Testing cardinalities between two tables

This section contains information and examples for the functions:

  1. check_cardinality_0_n(parent_table, primary_key_column, child_table, foreign_key_column)
  2. check_cardinality_1_n(parent_table, primary_key_column, child_table, foreign_key_column)
  3. check_cardinality_0_1(parent_table, primary_key_column, child_table, foreign_key_column)
  4. check_cardinality_1_1(parent_table, primary_key_column, child_table, foreign_key_column)

The four functions for testing for a specific kind of cardinality of the relation all require a parent table and a child table as input. All these functions first test if this requirement is fulfilled by checking if:

  1. primary_key_column is a unique key for parent_table
  2. The set of values of foreign_key_column is a subset of the set of values of primary_key_column

The cardinality specifications 0_n, 1_n, 0_1, 1_1 refer to the expected relation, that the child table has with the parent table. The numbers ‘0’, ‘1’ and ‘n’ refer to the number of values in the child table’s column (foreign_key_column) that correspond to each value of the parent table’s column (primary_key_column). ‘n’ means more than one in this context, with no upper limit.

0_n means, that for each value of the parent_key_column, the number of corresponding records in the child table is unrestricted. 1_n means, that for each value of the parent_key_column there is at least one coresponding record in the child table. This means that there is a “surjective” relation from the child table to the parent table w.r.t. the specified columns, i.e. for each parent table column value there exists at least one equal child table column value.

0_1 means, that for each value of the parent_key_column, minimally ‘0’ and maximally ‘1’ value has to correspond to it in the child table’s column. This means that there is a “injective” relation from the child table to the parent table w.r.t. the specified columns, i.e. no parent table column value is addressed multiple times. But not all of the parent table column values have to be referred to.

1_1 means, that for each value of the parent_key_column, precisely ‘1’ value has to correspond to it in the child table’s column. This means that there is a “bijective” (“injective” AND “surjective”) relation between the child table and the parent table w.r.t. the specified columns, i.e. the set of values of the two columns is equal and there are no duplicates in either of them.

Table surgery

Relevant functions are:

  1. decompose_table(.data, new_id_column, ...)
  2. reunite_parent_child(child_table, parent_table, id_column)
  3. reunite_parent_child_from_list(list_of_parent_child_tables, id_column)

The first function implements table normalization. An existing table is split into a parent table (i.e. a lookup table) and a child table (containing the observations), linked by a key column (here: new_id_column). Basically, a foreign key relation would be created, pointing from the child table’s new_id_column to the parent table’s corresponding column, which can be seen as the parent table’s primary key column. The function decompose_table() does this, as you can see in the following example:

mtcars_tibble <- as_tibble(mtcars)
mtcars_tibble
#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows
decomposed_table <- decompose_table(mtcars_tibble, am_gear_carb_id, am, gear, carb)
decomposed_table
#> $child_table
#> # A tibble: 32 x 9
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs am_gear_carb_id
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>           <int>
#>  1  21       6  160    110  3.9   2.62  16.5     0               9
#>  2  21       6  160    110  3.9   2.88  17.0     0               9
#>  3  22.8     4  108     93  3.85  2.32  18.6     1               7
#>  4  21.4     6  258    110  3.08  3.22  19.4     1               1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0               2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1               1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0               4
#>  8  24.4     4  147.    62  3.69  3.19  20       1               5
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1               5
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1               6
#> # … with 22 more rows
#> 
#> $parent_table
#> # A tibble: 13 x 4
#>    am_gear_carb_id    am  gear  carb
#>              <int> <dbl> <dbl> <dbl>
#>  1               1     0     3     1
#>  2               2     0     3     2
#>  3               3     0     3     3
#>  4               4     0     3     4
#>  5               5     0     4     2
#>  6               6     0     4     4
#>  7               7     1     4     1
#>  8               8     1     4     2
#>  9               9     1     4     4
#> 10              10     1     5     2
#> 11              11     1     5     4
#> 12              12     1     5     6
#> 13              13     1     5     8

A new column is created, with which the two tables can be joined by again, essentially creating the original table.

The functions which do the inverse operation, i.e. joining a parent and a child table and subsequently dropping the new_id_column, are reunite_parent_child() and reunite_parent_child_from_list(). The former takes as arguments two tables and the unquoted name of the ID column, and the latter takes as arguments a list of two tables plus the unquoted name of the ID column:

# Shortcut:
reunite_parent_child_from_list(decomposed_table, id_column = am_gear_carb_id)

Currently these functions exist only as a low-level operation on tables. We plan to extend this operation to dm objects.