5 Tidying
Rows, columns, cells.
This chapter discusses pivoting and data tidying with the help of the tidyr package.
5.1 Pivoting
Click here to show setup code.
## [conflicted] Removing existing preference
## [conflicted] Will prefer [34mdplyr::filter[39m over any other package
## [conflicted] Removing existing preference
## [conflicted] Will prefer [34mdplyr::lag[39m over any other package
Pivoting describes operations that help rearrange data in different ways. The following two tables contain the same data arranged differently.
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## # … with 9 more rows
Both tables contain country
and year
column that describe the source of the measurements.
The “wider” version, table1
, contains two columns that hold the number of cases (of a disease) and the population for the corresponding country in the corresponding year.
In the “longer” version, table2
, the number of cases and the population are stored in the same count
column, with the type
column defining the measurement.
Somewhat counter-intuitively, “longer-form” data is often better suited for analyzing data. “Wider-form” data makes better use of screen space, but may be more difficult to work with.
The following example computes the maximum number of cases and population for each country. For the wider form, this requires repeating the same expression for all columns. This may work with two columns but becomes tedious once more measurements are added.
table1 %>%
group_by(country) %>%
summarize(
max_cases = max(cases),
max_population = max(population)
) %>%
ungroup()
## # A tibble: 3 x 3
## country max_cases max_population
## <chr> <int> <int>
## 1 Afghanistan 2666 20595360
## 2 Brazil 80488 174504898
## 3 China 213766 1280428583
The _at
family of functions helps iterating over columns, but all columns still need to be enumerated.
(Specifying ranges of columns is rather brittle.)
## # A tibble: 3 x 3
## country cases population
## <chr> <int> <int>
## 1 Afghanistan 2666 20595360
## 2 Brazil 80488 174504898
## 3 China 213766 1280428583
If the data is in the “longer” form, it is sufficient to include type
in the grouping variables.
The same code works for arbitrary number of measurements.
## # A tibble: 6 x 3
## country type max
## <chr> <chr> <int>
## 1 Afghanistan cases 2666
## 2 Afghanistan population 20595360
## 3 Brazil cases 80488
## 4 Brazil population 174504898
## 5 China cases 213766
## 6 China population 1280428583
The following examples give a gentle introduction into pivoting.
5.1.1 Convert to longer form
The pivot_longer()
function takes a “wider-form” dataset and converts it to an equivalent dataset with more rows.
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
## # A tibble: 12 x 4
## country year name value
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## # … with 9 more rows
The -c(...)
notation indicates that all column except country
and year
are to be transformed into longer form.
The column names become the contents of the new name
column, the values are available in the value
column.
Your turn: The result of this operation isn’t strictly equivalent to table2
.
What is missing?
Alternatively, the names_to
and values_to
arguments allow specifying the names of the new columns.
## Error in .f(.x[[i]], ...) : object '.....' not found
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## # … with 9 more rows
5.1.2 Convert to wider form
The pivot_wider()
form does the inverse: it creates a dataset with fewer rows.
If the name
and value
columns are named differently, these columns can be provided via the names_from
and values_from
arguments.
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## # … with 9 more rows
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Your turn: pivot_wider()
expects columns name
and value
by default.
What do we need to do to get rid of names_from
and values_from
?
## Error in .f(.x[[i]], ...) : object '......' not found
5.1.3 Use cases
Data in “longer” form usually works better for plotting the values side by side, e.g. by assigning the type of value to an aesthetic.
Recall that each row in the data produces one geometric object in the corresponding layer.
For a bar chart that shows cases and population side by side, mapped to the y
aesthetic, the “longer” form is more natural.
-
table2
form requires only one layer, the fill color is determined automatically, the legend is created automatically -
table1
requires two layers, manual assignment of fill color, and manual creation of legend (not shown)
table2 %>%
ggplot() +
geom_col(aes(country, count, fill = type), position = "dodge") +
facet_wrap(~year) +
scale_y_log10()
table1 %>%
ggplot() +
geom_col(aes(country, population), position = "dodge", fill = "blue") +
geom_col(aes(country, cases), position = "dodge", fill = "red") +
facet_wrap(~year) +
scale_y_log10()
On the other hand, iIf only a single measurement needs to be plotted, the “wider” form is easier to work with.
-
table1
only requires selecting the correct column -
table2
requires afilter()
The “wider” form is also the only way to map different measures to different aesthetics, e.g. to correlate values.
5.1.4 Combining vertically
A different view on the same data is given in the two tables table4a
and table4b
.
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
The bind_rows()
function combines these two parts into a single table.
The .id = "type"
setting ensures that the input datasets gain different tags in the new type
column.
## # A tibble: 6 x 4
## type country `1999` `2000`
## <chr> <chr> <int> <int>
## 1 cases Afghanistan 745 2666
## 2 cases Brazil 37737 80488
## 3 cases China 212258 213766
## 4 population Afghanistan 19987071 20595360
## 5 population Brazil 172006362 174504898
## 6 population China 1272915272 1280428583
Your turn: As before, pivot_longer()
helps converting the results into something similar to table2
.
The result isn’t quite the same yet, can you spot the difference?
## Error in ...(., c(.....)) : could not find function "..."
5.1.5 Tidy data
From “R for data science”:
In a tidy dataset,
- each variable must have its own column.
- each observation must have its own row.
- each value must have its own cell.
The next two sections show examples of datasets that violate these principles.
5.1.6 The Anscombe dataset
This is a dataset where principal statistical summaries (means, standard deviations, intercept and slope after regression) are identical across four subgroups but the layout of the data differs vastly.
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## [ reached 'max' / getOption("max.print") -- omitted 9 rows ]
## # A tibble: 11 x 8
## x1 x2 x3 x4 y1 y2 y3 y4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.7 7.71
## # … with 8 more rows
Working with this data in original form is rather tedious.
## # A tibble: 1 x 16
## x1_mean x2_mean x3_mean x4_mean y1_mean y2_mean y3_mean
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 9 9 9 7.50 7.50 7.5
## # … with 9 more variables: y4_mean <dbl>, x1_sd <dbl>,
## # x2_sd <dbl>, x3_sd <dbl>, x4_sd <dbl>, y1_sd <dbl>,
## # y2_sd <dbl>, y3_sd <dbl>, y4_sd <dbl>
##
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
##
## Coefficients:
## (Intercept) x1
## 3.0001 0.5001
##
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
##
## Coefficients:
## (Intercept) x2
## 3.001 0.500
##
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
##
## Coefficients:
## (Intercept) x3
## 3.0025 0.4997
##
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
##
## Coefficients:
## (Intercept) x4
## 3.0017 0.4999
## Error in try(ggplot(anscombe) + ...) : '...' used in an incorrect context
Your turn: Which principles of tidy data does the anscombe
dataset violate?
Many operations become much more natural when the data is in longer form.
anscombe_longer <-
anscombe %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_pattern = "(.)(.)",
names_to = c(".value", "measurement")
) %>%
arrange(measurement, id)
anscombe_longer %>%
group_by(measurement) %>%
summarize_at(vars(x, y), list(~ mean(.), ~sd(.))) %>%
ungroup()
## # A tibble: 4 x 5
## measurement x_mean y_mean x_sd y_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 9 7.50 3.32 2.03
## 2 2 9 7.50 3.32 2.03
## 3 3 9 7.5 3.32 2.03
## 4 4 9 7.50 3.32 2.03
anscombe_longer %>%
group_by(measurement) %>%
summarize(model = list(lm(y ~ x, .))) %>%
ungroup() %>%
pull()
## [[1]]
##
## Call:
## lm(formula = y ~ x, data = .)
##
## Coefficients:
## (Intercept) x
## 3.0013 0.4999
##
##
## [[2]]
##
## Call:
## lm(formula = y ~ x, data = .)
##
## Coefficients:
## (Intercept) x
## 3.0013 0.4999
##
##
## [[3]]
##
## Call:
## lm(formula = y ~ x, data = .)
##
## Coefficients:
## (Intercept) x
## 3.0013 0.4999
##
##
## [[4]]
##
## Call:
## lm(formula = y ~ x, data = .)
##
## Coefficients:
## (Intercept) x
## 3.0013 0.4999
5.1.7 WHO data
The following example shows a case that violates the first two rules of tidy data: WHO data arranged for optimal use of screen space.
The column names define, in addition to the measurement type new_sp
, new_sn
, new_ep
and newrel
, the age and sex stratum of the corresponding measurements.
One single pivot_longer()
call transforms the data into a longer-form version with four measurement columns and one row for each age/sex stratum.
The names_pattern
is a regular expression that defines what part of the column name is stored where.
(Regular expressions are a powerful tool for parsing text data, out of scope for this lecture but very much worth looking into.)
The names_to
sequence defines, for each ()
group in names_pattern
, if the data encoded in the column name is stored in a new column or if it is kept as column name.
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524
## <chr> <chr> <chr> <int> <int> <int>
## 1 Afghan… AF AFG 1980 NA NA
## 2 Afghan… AF AFG 1981 NA NA
## 3 Afghan… AF AFG 1982 NA NA
## # … with 7,237 more rows, and 54 more variables:
## # new_sp_m2534 <int>, new_sp_m3544 <int>,
## # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>,
## # new_sp_f2534 <int>, new_sp_f3544 <int>,
## # new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## # new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>,
## # new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
## # new_sn_f014 <int>, new_sn_f1524 <int>,
## # new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## # new_ep_m014 <int>, new_ep_m1524 <int>,
## # new_ep_m2534 <int>, new_ep_m3544 <int>,
## # new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
## # new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>,
## # new_ep_f4554 <int>, new_ep_f5564 <int>, new_ep_f65 <int>,
## # newrel_m014 <int>, newrel_m1524 <int>,
## # newrel_m2534 <int>, newrel_m3544 <int>,
## # newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## # newrel_f014 <int>, newrel_f1524 <int>,
## # newrel_f2534 <int>, newrel_f3544 <int>,
## # newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>
who_longer <-
who %>%
pivot_longer(
-(country:year),
names_pattern = "([a-z_]+)_(.)([0-9]+)",
names_to = c(".value", "sex", "age")
)
who_longer
## # A tibble: 101,360 x 10
## country iso2 iso3 year sex age new_sp new_sn new_ep
## <chr> <chr> <chr> <int> <chr> <chr> <int> <int> <int>
## 1 Afghan… AF AFG 1980 m 014 NA NA NA
## 2 Afghan… AF AFG 1980 m 1524 NA NA NA
## 3 Afghan… AF AFG 1980 m 2534 NA NA NA
## # … with 1.014e+05 more rows, and 1 more variable:
## # newrel <int>
## # A tibble: 14 x 3
## sex age n
## <chr> <chr> <int>
## 1 f 014 7240
## 2 f 1524 7240
## 3 f 2534 7240
## # … with 11 more rows
5.1.8 Formatting tables for presentation
Wide tables make better use of screen or paper space, and should be used when presenting tables that summarize across multiple dimensions.
For this, ftable()
is the better tool.
## type cases population
## country year
## Afghanistan 1999 745 19987071
## 2000 2666 20595360
## Brazil 1999 37737 172006362
## 2000 80488 174504898
## China 1999 212258 1272915272
## 2000 213766 1280428583
These can be converted to HTML for inclusion in reports, e.g. with memisc::format_html()
.
table2_html <-
table2 %>%
xtabs(count ~ ., .) %>%
ftable(col.vars = c("year", "type")) %>%
memisc::format_html()
cat(table2_html)
year: | 1999 | 2000 | |||||||||||
country | type: | cases | population | cases | population | ||||||||
Afghanistan | 745 | 19987071 | 2666 | 20595360 | |||||||||
Brazil | 37737 | 172006362 | 80488 | 174504898 | |||||||||
China | 212258 | 1272915272 | 213766 | 1280428583 |
The {gt} package is a modern approach to present tables for formatting.
5.2 Separating and uniting
Click here to show setup code.
## [conflicted] Removing existing preference
## [conflicted] Will prefer [34mdplyr::filter[39m over any other package
## [conflicted] Removing existing preference
## [conflicted] Will prefer [34mdplyr::lag[39m over any other package
The table3
table violates the third principle of tidy data: each cell in the rate
column contains two values.
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The separate()
verb offers a convenient way to deal with this situation, including automatic type conversion.
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
The inverse is offered by unite()
.
The data in table5
stores year data in two columns.
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
The result needs a few tweaks to finally resemble table3
.
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
See the help for further details.
5.2.1 Parsing numbers
## # A tibble: 2 x 1
## num
## <chr>
## 1 1'000.00
## 2 2'000'000.00
## Warning: Expected 1 pieces. Additional pieces discarded in 2
## rows [1, 2].
## # A tibble: 2 x 1
## num
## <chr>
## 1 1
## 2 2
thousand_separator %>%
mutate(num = str_replace_all(num, "[^-0-9.]", "")) %>%
mutate(num = as.numeric(num))
## # A tibble: 2 x 1
## num
## <dbl>
## 1 1000
## 2 2000000