4 Pairwise iteration and nesting
This chapter explores iterating over pairs (or generally lists) of vectors of the same length. The relationship between vectors and data frame columns is especially helpful here, because values in one row of a tibble naturally correspond to accessing the same index in multiple vectors.
This chapter uses the manipulated_data
object from the “Manipulating all datasets” section.
library(tidyverse)
library(here)
dict <- readxl::read_excel(here("data/cities.xlsx"))
input_data <-
dict %>%
select(city_code, weather_filename) %>%
deframe() %>%
map(~ readxl::read_excel(here(.)))
find_good_times <- function(data) {
data %>%
select(time, contains("emperature")) %>%
filter(temperature >= 14)
}
good_times <-
input_data %>%
map(find_good_times)
good_times
## $berlin
## # A tibble: 16 x 3
## time temperature apparentTemperature
## <dttm> <dbl> <dbl>
## 1 2019-04-28 17:00:00 14.1 14.1
## 2 2019-04-29 12:00:00 15.6 15.6
## 3 2019-04-29 13:00:00 17.4 17.4
## # … with 13 more rows
##
## $toronto
## # A tibble: 0 x 3
## # … with 3 variables: time <dttm>, temperature <dbl>,
## # apparentTemperature <dbl>
##
## $tel_aviv
## # A tibble: 49 x 3
## time temperature apparentTemperature
## <dttm> <dbl> <dbl>
## 1 2019-04-28 15:00:00 23.9 23.9
## 2 2019-04-28 16:00:00 23.1 23.1
## 3 2019-04-28 17:00:00 22.4 22.4
## # … with 46 more rows
##
## $zurich
## # A tibble: 1 x 3
## time temperature apparentTemperature
## <dttm> <dbl> <dbl>
## 1 2019-04-30 15:00:00 14.3 14.3
4.1 Manipulating pairwise
Here we discuss cases when you want to iterate through two lists (of the same length) in parallel and use each value pair as two of the input parameters of a function.
We first prepare a list of future output filenames:
output_filenames <- tempfile(names(good_times), fileext = ".csv")
output_filenames
## [1] "/tmp/RtmpCquLue/berlin2db86db4890.csv"
## [2] "/tmp/RtmpCquLue/toronto2db85d48b7b5.csv"
## [3] "/tmp/RtmpCquLue/tel_aviv2db8616cda7.csv"
## [4] "/tmp/RtmpCquLue/zurich2db83741747a.csv"
We want to use readr::write_csv()
to write each tibble into the respective file.
write_csv()
needs at least 2 arguments: the tibble itself and the path to the filename.
For illustration, we implement a file-centric wrapper function that takes the file name as first argument and also prints a message every time a file is written.
We use map2()
to handle this:
process_csv <- function(file, data) {
readr::write_csv(data, file)
message("Writing ", file)
invisible(file)
}
map2(good_times, output_filenames, ~ process_csv(..2, ..1))
## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv
## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv
## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv
## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv
## $berlin
## [1] "/tmp/RtmpCquLue/berlin2db86db4890.csv"
##
## $toronto
## [1] "/tmp/RtmpCquLue/toronto2db85d48b7b5.csv"
##
## $tel_aviv
## [1] "/tmp/RtmpCquLue/tel_aviv2db8616cda7.csv"
##
## $zurich
## [1] "/tmp/RtmpCquLue/zurich2db83741747a.csv"
invisible(map2(good_times, output_filenames, ~ process_csv(..2, ..1)))
## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv
## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv
## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv
## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv
Because process_csv()
returns the file name, it is available as output.
Since we are just interested in the side-effects of write_csv()
and not in the displayed output, we can use the related function walk2()
.
walk2(good_times, output_filenames, ~ process_csv(..2, ..1))
## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv
## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv
## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv
## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv
print(walk2(good_times, output_filenames, ~ process_csv(..2, ..1)))
## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv
## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv
## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv
## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv
## $berlin
## # A tibble: 16 x 3
## time temperature apparentTemperature
## <dttm> <dbl> <dbl>
## 1 2019-04-28 17:00:00 14.1 14.1
## 2 2019-04-29 12:00:00 15.6 15.6
## 3 2019-04-29 13:00:00 17.4 17.4
## # … with 13 more rows
##
## $toronto
## # A tibble: 0 x 3
## # … with 3 variables: time <dttm>, temperature <dbl>,
## # apparentTemperature <dbl>
##
## $tel_aviv
## # A tibble: 49 x 3
## time temperature apparentTemperature
## <dttm> <dbl> <dbl>
## 1 2019-04-28 15:00:00 23.9 23.9
## 2 2019-04-28 16:00:00 23.1 23.1
## 3 2019-04-28 17:00:00 22.4 22.4
## # … with 46 more rows
##
## $zurich
## # A tibble: 1 x 3
## time temperature apparentTemperature
## <dttm> <dbl> <dbl>
## 1 2019-04-30 15:00:00 14.3 14.3
walk2()
returns its first argument so that it can be used in a pipe.
4.1.1 Exercises
What does the following code display?
good_times %>% walk2(output_filenames, ~ readr::write_csv(..1, ..2)) %>% map_int(nrow)
4.2 Moving to tibble-land
Click here to show setup code.
library(tidyverse)
library(here)
dict <- readxl::read_excel(here("data/cities.xlsx"))
input_data <-
dict %>%
select(city_code, weather_filename) %>%
deframe() %>%
map(~ readxl::read_excel(here(.)))
find_good_times <- function(data) {
data %>%
select(time, contains("emperature")) %>%
filter(temperature >= 14)
}
good_times <-
input_data %>%
map(find_good_times)
How to combine the abilities of map()
& co., which work on vectors and lists, with our commonly used data structure, the tibble?
We start with the named list of tibbles called input_data
from section “Processing all files” and with dict
from section “Named vectors and two-column tibbles”.
Calling enframe()
to produce a data frame from input_data
leads to a maybe at first surprising, but oftentimes useful result:
nested_input_data <-
input_data %>%
enframe()
nested_input_data
## # A tibble: 4 x 2
## name value
## <chr> <list>
## 1 berlin <tibble [49 × 18]>
## 2 toronto <tibble [49 × 18]>
## 3 tel_aviv <tibble [49 × 17]>
## 4 zurich <tibble [49 × 18]>
This is because lists are also vectors.
In our case each list entry contains a tibble, which can be “nested” into each entry of column value
.
Starting with the tibble dict
we can see how dpylr::mutate()
and map()
can nicely work together to produce a somewhat similar result:
dict %>%
select(city_code, weather_filename) %>%
mutate(
data = map(weather_filename, ~ readxl::read_excel(here(.)))
)
## # A tibble: 4 x 3
## city_code weather_filename data
## <chr> <chr> <list>
## 1 berlin data/weather/berlin.xlsx <tibble [49 × 18]>
## 2 toronto data/weather/toronto.xlsx <tibble [49 × 18]>
## 3 tel_aviv data/weather/tel_aviv.xlsx <tibble [49 × 17]>
## 4 zurich data/weather/zurich.xlsx <tibble [49 × 18]>
This works because R
interprets columns of tibbles as vectors, which can be fed to map()
.
To simplify the map()
call, we create an intermediate column:
dict %>%
select(city_code, weather_filename) %>%
mutate(path = here(weather_filename)) %>%
mutate(data = map(path, readxl::read_excel))
## # A tibble: 4 x 4
## city_code weather_filename path data
## <chr> <chr> <chr> <list>
## 1 berlin data/weather/berlin.… /home/travis/build/krlmlr/tid… <tibble […
## 2 toronto data/weather/toronto… /home/travis/build/krlmlr/tid… <tibble […
## 3 tel_aviv data/weather/tel_avi… /home/travis/build/krlmlr/tid… <tibble […
## 4 zurich data/weather/zurich.… /home/travis/build/krlmlr/tid… <tibble […
Staying in “tibble-land” as long as possible helps retaining other important components of the data you are processing, so that you can keep using familiar data transformation tools.
dict_data <-
dict %>%
mutate(
data = map(weather_filename, ~ readxl::read_excel(here(.))),
rows = map_int(data, nrow),
) %>%
select(-weather_filename)
dict_data
## # A tibble: 4 x 6
## city_code name lng lat data rows
## <chr> <chr> <dbl> <dbl> <list> <int>
## 1 berlin Berlin 13.4 52.5 <tibble [49 × 18]> 49
## 2 toronto Toronto -79.4 43.7 <tibble [49 × 18]> 49
## 3 tel_aviv Tel Aviv 34.8 32.1 <tibble [49 × 17]> 49
## 4 zurich Zürich 8.54 47.4 <tibble [49 × 18]> 49
This pattern can also be used with the map2()
family of functions:
dict_data_with_desc <-
dict_data %>%
mutate(
desc = map2_chr(
name, rows,
~ paste0(..2, " rows in data for ", ..1)
)
)
Because mutate()
always appends to the end, the most recently added column can always be accessed with pull()
:
dict_data_with_desc %>%
pull()
## [1] "49 rows in data for Berlin" "49 rows in data for Toronto"
## [3] "49 rows in data for Tel Aviv" "49 rows in data for Zürich"
More generally, pmap()
supports functions with an arbitrary number of arguments:
dict_data %>%
mutate(
cols = map_int(data, ncol),
desc = pmap_chr(
list(name, rows, cols),
~ paste0(..2, " rows and ", ..3, " cols in data for ", ..1)
)
)
## # A tibble: 4 x 8
## city_code name lng lat data rows cols desc
## <chr> <chr> <dbl> <dbl> <list> <int> <int> <chr>
## 1 berlin Berlin 13.4 52.5 <tibble [… 49 18 49 rows and 18 col…
## 2 toronto Toronto -79.4 43.7 <tibble [… 49 18 49 rows and 18 col…
## 3 tel_aviv Tel Av… 34.8 32.1 <tibble [… 49 17 49 rows and 17 col…
## 4 zurich Zürich 8.54 47.4 <tibble [… 49 18 49 rows and 18 col…
4.2.1 Exercises
The
imap()
family of functions iterates over a vector and its names:input_data %>% imap_chr(~ paste0(.y, ": ", nrow(.x), " rows"))
## berlin toronto tel_aviv ## "berlin: 49 rows" "toronto: 49 rows" "tel_aviv: 49 rows" ## zurich ## "zurich: 49 rows"
Implement the same functionality using
map2()
inside amutate()
, andenframe()
:good_times %>% ___() %>% mutate(___ = map2()) %>% deframe()
4.3 Nesting and unnesting
Click here to show setup code.
library(tidyverse)
library(here)
dict <- readxl::read_excel(here("data/cities.xlsx"))
dict_data <-
dict %>%
mutate(data = map(weather_filename, ~ readxl::read_excel(here(.)))) %>%
select(-weather_filename)
How to work with nested data?
We start with the tibble dict_data
from section “Moving to tibble-land”, which includes the nested tibbles in its column data
.
If we want to actually look at the data we can directly use tidyr::unnest()
on the whole tibble, which by default acts on all list-columns.
This expands our tibble by repeating the formerly unnested column entries as many times, as each nested tibble has rows:
dict_data %>%
unnest()
## # A tibble: 196 x 22
## city_code name lng lat time summary icon
## <chr> <chr> <dbl> <dbl> <dttm> <chr> <chr>
## 1 berlin Berl… 13.4 52.5 2019-04-28 15:00:00 Mostly… part…
## 2 berlin Berl… 13.4 52.5 2019-04-28 16:00:00 Mostly… part…
## 3 berlin Berl… 13.4 52.5 2019-04-28 17:00:00 Mostly… part…
## # … with 193 more rows, and 15 more variables: precipIntensity <dbl>,
## # precipProbability <dbl>, temperature <dbl>, apparentTemperature <dbl>,
## # dewPoint <dbl>, humidity <dbl>, pressure <dbl>, windSpeed <dbl>,
## # windGust <dbl>, windBearing <dbl>, cloudCover <dbl>, uvIndex <dbl>,
## # visibility <dbl>, ozone <dbl>, precipType <chr>
This is very similar to bind_rows()
of the data
column.
dict_data %>%
pull(data) %>%
bind_rows()
## # A tibble: 196 x 18
## time summary icon precipIntensity precipProbabili…
## <dttm> <chr> <chr> <dbl> <dbl>
## 1 2019-04-28 15:00:00 Mostly… part… 0 0
## 2 2019-04-28 16:00:00 Mostly… part… 0 0
## 3 2019-04-28 17:00:00 Mostly… part… 0 0
## # … with 193 more rows, and 13 more variables: temperature <dbl>,
## # apparentTemperature <dbl>, dewPoint <dbl>, humidity <dbl>,
## # pressure <dbl>, windSpeed <dbl>, windGust <dbl>, windBearing <dbl>,
## # cloudCover <dbl>, uvIndex <dbl>, visibility <dbl>, ozone <dbl>,
## # precipType <chr>
check_columns_same <- function(x, y) {
stopifnot(identical(colnames(x), colnames(y)))
}
bind_rows <- function(data_frames) {
# Called for the side effect
reduce(data_frames, check_columns_same)
dplyr::bind_rows(data_frames)
}
try(
dict_data %>%
pull(data) %>%
bind_rows()
)
## Error in fn(out, elt, ...) :
## identical(colnames(x), colnames(y)) is not TRUE
Data flattened in this way is useful if the parts can be combined naturally into a larger dataset. Iterating over columns in the nested view corresponds to grouped operations in the flat view.
dict_data %>%
mutate(n = map_int(data, nrow)) %>%
select(-data)
## # A tibble: 4 x 5
## city_code name lng lat n
## <chr> <chr> <dbl> <dbl> <int>
## 1 berlin Berlin 13.4 52.5 49
## 2 toronto Toronto -79.4 43.7 49
## 3 tel_aviv Tel Aviv 34.8 32.1 49
## 4 zurich Zürich 8.54 47.4 49
dict_data %>%
unnest() %>%
count(name)
## # A tibble: 4 x 2
## name n
## <chr> <int>
## 1 Berlin 49
## 2 Tel Aviv 49
## 3 Toronto 49
## 4 Zürich 49
Inversely, if you want to have a more condensed view of your data, you can nest again.
By default, the function tidyr::nest()
will nest all data.
Therefore it is often useful to tell it, which columns to ignore:
dict_data %>%
unnest() %>%
nest(-city_code, -name, -lng, -lat)
## # A tibble: 4 x 5
## city_code name lng lat data
## <chr> <chr> <dbl> <dbl> <list>
## 1 berlin Berlin 13.4 52.5 <tibble [49 × 18]>
## 2 toronto Toronto -79.4 43.7 <tibble [49 × 18]>
## 3 tel_aviv Tel Aviv 34.8 32.1 <tibble [49 × 18]>
## 4 zurich Zürich 8.54 47.4 <tibble [49 × 18]>
Using this, we structure our data in new, customized ways.
For processing of daily data over all cities, we create a new column date
:
dict_data %>%
unnest() %>%
mutate(date = as.Date(time)) %>%
nest(-date)
## # A tibble: 3 x 2
## date data
## <date> <list>
## 1 2019-04-28 <tibble [36 × 22]>
## 2 2019-04-29 <tibble [96 × 22]>
## 3 2019-04-30 <tibble [64 × 22]>
4.3.1 Exercises
Implement the following code as a mapping over a nested tibble. Use a helper function:
iris %>% group_by(Species) %>% summarize_all(list(Mean = mean)) %>% ungroup()
## # A tibble: 3 x 5 ## Species Sepal.Length_Me… Sepal.Width_Mean Petal.Length_Me… ## <fct> <dbl> <dbl> <dbl> ## 1 setosa 5.01 3.43 1.46 ## 2 versic… 5.94 2.77 4.26 ## 3 virgin… 6.59 2.97 5.55 ## # … with 1 more variable: Petal.Width_Mean <dbl>
summarize_to_mean <- function(data) { data %>% ___(_____) } iris %>% nest(___) %>% mutate(data = map(___, summarize_to_mean)) %>% unnest()
When is a grouped operation preferable over nesting? Discuss.
Data frames are lists under the hood. Explain the output of the following code. What use cases can you imagine?
dict_data %>% as.list() %>% enframe()
## # A tibble: 5 x 2 ## name value ## <chr> <list> ## 1 city_code <chr [4]> ## 2 name <chr [4]> ## 3 lng <dbl [4]> ## 4 lat <dbl [4]> ## 5 data <list [4]>