vignettes/tech-dm-low-level.Rmd
tech-dm-low-level.Rmd
In this document we will present several specialized functions for conducting basic tests about key conditions and about relations between tables. We will also describe functions that can be used for splitting and uniting tables.
This section contains information and examples about the following functions:
When you have tables (data frames) that are connected by key relations, this package can help you to verify the assumed key relations and/or determine the existing key relations between the tables. For example, if you have tables:
data_1 <- tibble(a = c(1, 2, 1), b = c(1, 4, 1), c = c(5, 6, 7))
data_2 <- tibble(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
and you want to know if a
is a primary key for data_1
, you can use the check_key()
function:
check_key(data_1, a)
#> Error: (`a`) not a unique key of `data_1`.
Mind the error message when a test is not passed.
For data_2
, column a
is a key:
check_key(data_2, a)
To see if a column of one table contains only those values that are also present in another column of another table, the check_subset()
function can be used:
check_subset(data_1, a, data_2, a)
This function is important for determining if a column is a foreign key to some other table. What about the inverse relation?
check_subset(data_2, a, data_1, a)
#> # A tibble: 1 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 3 6 9
#> Error: Column `a` of table `data_2` contains values (see examples above) that are not present in column `a` of table `data_1`.
It should be kept in mind that check_subset()
does not test if column c2
is a unique key of table t2
. In order to find out if a (child) table t1
contains a column c1
that is a foreign key to a (parent) table t2
with the corresponding column c2
, the following method should be used:
check_key(t2, c2)
check_subset(t1, c1, t2, c2)
To check both directions at once, and to find out if the unique values of c_1
in t_1
are the same as those of c_2
in t_2
, {dm} provides the function check_set_equality()
:
check_set_equality(data_1, a, data_2, a)
#> # A tibble: 1 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 3 6 9
#> Error: Column `a` of table `data_2` contains values (see examples above) that are not present in column `a` of table `data_1`..
Introducing one more table enables us to show how it looks when the test is passed:
data_3 <- tibble(a = c(2, 1, 2), b = c(4, 5, 6), c = c(7, 8, 9))
check_set_equality(data_1, a, data_3, a)
If the test is passed, the return value of the function will be the first table parameter (invisibly). This ensures that the functions can be conveniently used in a pipe configuration.
This section contains information and examples for the functions
check_cardinality_0_n(parent_table, primary_key_column, child_table, foreign_key_column)
check_cardinality_1_n(parent_table, primary_key_column, child_table, foreign_key_column)
check_cardinality_0_1(parent_table, primary_key_column, child_table, foreign_key_column)
check_cardinality_1_1(parent_table, primary_key_column, child_table, foreign_key_column)
examine_cardinality(parent_table, primary_key_column, child_table, foreign_key_column)
The four functions for testing for a specific kind of cardinality of the relation all require a parent table and a child table as inputs. The functions first test if that requirement is fulfilled by checking if:
primary_key_column
is a unique key for parent_table
foreign_key_column
is a subset of the set of values of primary_key_column
The cardinality specifications 0_n
, 1_n
, 0_1
, 1_1
refer to the expected relation that the child table has with the parent table. The numbers ‘0’, ‘1’ and ‘n’ refer to the number of values in the child table’s column (foreign_key_column
) that correspond to each value of the parent table’s column (primary_key_column
). ‘n’ means more than one in this context, with no upper limit.
0_n
means, that for each value of the parent_key_column
, the number of corresponding records in the child table is unrestricted. 1_n
means, that for each value of the parent_key_column
there is at least one corresponding record in the child table. This means that there is a “surjective” relation from the child table to the parent table w.r.t. the specified columns, i.e. for each parent table column value there exists at least one equal child table column value.
0_1
means, that for each value of the parent_key_column
, at least zero and at most one value has to correspond to it in the column of the child table. This means that there is an “injective” relation from the child table to the parent table w.r.t. the specified columns, i.e. no parent table column value is addressed multiple times. But not all of the parent table column values have to be referred to.
1_1
means, that for each value of the parent_key_column
, exactly one value has to correspond to it in the child table’s column. This means that there is a “bijective” (“injective” AND “surjective”) relation between the child table and the parent table w.r.t. the specified columns, i.e. the set of values of the two columns is equal and there are no duplicates in either of them.
Also examine_cardinality()
first performs the above mentioned tests to figure out, if the parent-child table relationship criteria are met. Subsequently, two further checks are made to determine the nature of the relation (surjective, injective, bijective or none of these) between the two columns.
Given the following three data frames:
Here are some examples of how the cardinality testing functions can be used:
# This does not pass, `c` is not unique key of d2:
check_cardinality_0_n(d2, c, d1, a)
#> Error: (`c`) not a unique key of `d2`.
# This passes, multiple values in d2$c are allowed:
check_cardinality_0_n(d1, a, d2, c)
# This does not pass, injectivity is violated:
check_cardinality_1_1(d1, a, d2, c)
#> Error: 1..1 cardinality (bijectivity) is not given: Column `c` in table `d2` contains duplicate values.
# This passes:
check_cardinality_0_1(d1, a, d3, c)
examine_cardinality()
returns the type of relation, e.g.:
examine_cardinality(d1, a, d3, c)
#> [1] "injective mapping (child: 0 or 1 -> parent: 1)"
examine_cardinality(d1, a, d2, c)
#> [1] "surjective mapping (child: 1 to n -> parent: 1)"
examine_cardinality(d1, a, d1, a)
#> [1] "bijective mapping (child: 1 -> parent: 1)"
examine_cardinality(d1, a, d4, a)
#> [1] "generic mapping (child: 0 to n -> parent: 1)"
The relevant functions are:
decompose_table(.data, new_id_column, ...)
reunite_parent_child(child_table, parent_table, id_column)
reunite_parent_child_from_list(list_of_parent_child_tables, id_column)
The first function implements table normalization. An existing table is split into a parent table (i.e. a lookup table) and a child table (containing the observations), linked by a key column (here: new_id_column
). Basically, a foreign key relation would be created, pointing from the new_id_column
of the child table to the parent table’s corresponding column, which can be seen as the parent table’s primary key column. The function decompose_table()
does that, as can be seen in the following example:
mtcars_tibble <- as_tibble(mtcars)
mtcars_tibble
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
decomposed_table <- decompose_table(mtcars_tibble, am_gear_carb_id, am, gear, carb)
decomposed_table
#> $child_table
#> # A tibble: 32 x 9
#> mpg cyl disp hp drat wt qsec vs am_gear_carb_id
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 7
#> 2 21 6 160 110 3.9 2.88 17.0 0 7
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 8
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 3
#> 8 24.4 4 147. 62 3.69 3.19 20 1 4
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 4
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 5
#> # … with 22 more rows
#>
#> $parent_table
#> # A tibble: 13 x 4
#> am_gear_carb_id am gear carb
#> <int> <dbl> <dbl> <dbl>
#> 1 7 1 4 4
#> 2 8 1 4 1
#> 3 1 0 3 1
#> 4 2 0 3 2
#> 5 3 0 3 4
#> 6 4 0 4 2
#> 7 5 0 4 4
#> 8 6 0 3 3
#> 9 9 1 4 2
#> 10 10 1 5 2
#> 11 11 1 5 4
#> 12 12 1 5 6
#> 13 13 1 5 8
A new column is created, with which the two tables can be joined again, essentially creating the original table.
The functions that do the inverse operation, i.e. join a parent and a child table and subsequently drop the new_id_column
, are reunite_parent_child()
and reunite_parent_child_from_list()
. The former takes as arguments two tables and the unquoted name of the ID column, and the latter takes as arguments a list of two tables plus the unquoted name of the ID column:
parent_table <- decomposed_table$parent_table
child_table <- decomposed_table$child_table
reunite_parent_child(child_table, parent_table, id_column = am_gear_carb_id)
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
# Shortcut:
reunite_parent_child_from_list(decomposed_table, id_column = am_gear_carb_id)
Currently these functions only exist as a low-level operation on tables. We plan to extend this operation to dm
objects in the future.