4 Pairwise iteration and nesting

This chapter explores iterating over pairs (or generally lists) of vectors of the same length. The relationship between vectors and data frame columns is especially helpful here, because values in one row of a tibble naturally correspond to accessing the same index in multiple vectors.

This chapter uses the manipulated_data object from the “Manipulating all datasets” section.

library(tidyverse)
library(here)

dict <- readxl::read_excel(here("data/cities.xlsx"))

input_data <-
  dict %>%
  select(city_code, weather_filename) %>%
  deframe() %>%
  map(~ readxl::read_excel(here(.)))

find_good_times <- function(data) {
  data %>%
    select(time, contains("emperature")) %>%
    filter(temperature >= 14)
}

good_times <-
  input_data %>%
  map(find_good_times)

good_times

## $berlin
## # A tibble: 16 x 3
##   time                temperature apparentTemperature
##   <dttm>                    <dbl>               <dbl>
## 1 2019-04-28 17:00:00        14.1                14.1
## 2 2019-04-29 12:00:00        15.6                15.6
## 3 2019-04-29 13:00:00        17.4                17.4
## # … with 13 more rows
## 
## $toronto
## # A tibble: 0 x 3
## # … with 3 variables: time <dttm>, temperature <dbl>,
## #   apparentTemperature <dbl>
## 
## $tel_aviv
## # A tibble: 49 x 3
##   time                temperature apparentTemperature
##   <dttm>                    <dbl>               <dbl>
## 1 2019-04-28 15:00:00        23.9                23.9
## 2 2019-04-28 16:00:00        23.1                23.1
## 3 2019-04-28 17:00:00        22.4                22.4
## # … with 46 more rows
## 
## $zurich
## # A tibble: 1 x 3
##   time                temperature apparentTemperature
##   <dttm>                    <dbl>               <dbl>
## 1 2019-04-30 15:00:00        14.3                14.3

4.1 Manipulating pairwise

Here we discuss cases when you want to iterate through two lists (of the same length) in parallel and use each value pair as two of the input parameters of a function.

We first prepare a list of future output filenames:

output_filenames <- tempfile(names(good_times), fileext = ".csv")
output_filenames

## [1] "/tmp/RtmpCquLue/berlin2db86db4890.csv"  
## [2] "/tmp/RtmpCquLue/toronto2db85d48b7b5.csv"
## [3] "/tmp/RtmpCquLue/tel_aviv2db8616cda7.csv"
## [4] "/tmp/RtmpCquLue/zurich2db83741747a.csv"

We want to use readr::write_csv() to write each tibble into the respective file. write_csv() needs at least 2 arguments: the tibble itself and the path to the filename. For illustration, we implement a file-centric wrapper function that takes the file name as first argument and also prints a message every time a file is written. We use map2() to handle this:

process_csv <- function(file, data) {
  readr::write_csv(data, file)
  message("Writing ", file)
  invisible(file)
}

map2(good_times, output_filenames, ~ process_csv(..2, ..1))

## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv

## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv

## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv

## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv

## $berlin
## [1] "/tmp/RtmpCquLue/berlin2db86db4890.csv"
## 
## $toronto
## [1] "/tmp/RtmpCquLue/toronto2db85d48b7b5.csv"
## 
## $tel_aviv
## [1] "/tmp/RtmpCquLue/tel_aviv2db8616cda7.csv"
## 
## $zurich
## [1] "/tmp/RtmpCquLue/zurich2db83741747a.csv"

invisible(map2(good_times, output_filenames, ~ process_csv(..2, ..1)))

## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv

## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv

## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv

## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv

Because process_csv() returns the file name, it is available as output. Since we are just interested in the side-effects of write_csv() and not in the displayed output, we can use the related function walk2().

walk2(good_times, output_filenames, ~ process_csv(..2, ..1))

## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv

## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv

## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv

## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv

print(walk2(good_times, output_filenames, ~ process_csv(..2, ..1)))

## Writing /tmp/RtmpCquLue/berlin2db86db4890.csv

## Writing /tmp/RtmpCquLue/toronto2db85d48b7b5.csv

## Writing /tmp/RtmpCquLue/tel_aviv2db8616cda7.csv

## Writing /tmp/RtmpCquLue/zurich2db83741747a.csv

## $berlin
## # A tibble: 16 x 3
##   time                temperature apparentTemperature
##   <dttm>                    <dbl>               <dbl>
## 1 2019-04-28 17:00:00        14.1                14.1
## 2 2019-04-29 12:00:00        15.6                15.6
## 3 2019-04-29 13:00:00        17.4                17.4
## # … with 13 more rows
## 
## $toronto
## # A tibble: 0 x 3
## # … with 3 variables: time <dttm>, temperature <dbl>,
## #   apparentTemperature <dbl>
## 
## $tel_aviv
## # A tibble: 49 x 3
##   time                temperature apparentTemperature
##   <dttm>                    <dbl>               <dbl>
## 1 2019-04-28 15:00:00        23.9                23.9
## 2 2019-04-28 16:00:00        23.1                23.1
## 3 2019-04-28 17:00:00        22.4                22.4
## # … with 46 more rows
## 
## $zurich
## # A tibble: 1 x 3
##   time                temperature apparentTemperature
##   <dttm>                    <dbl>               <dbl>
## 1 2019-04-30 15:00:00        14.3                14.3

walk2() returns its first argument so that it can be used in a pipe.

4.1.1 Exercises

What does the following code display?

good_times %>% 
  walk2(output_filenames, ~ readr::write_csv(..1, ..2)) %>% 
  map_int(nrow)

4.2 Moving to tibble-land

Click here to show setup code.

library(tidyverse)
library(here)

dict <- readxl::read_excel(here("data/cities.xlsx"))

input_data <-
  dict %>%
  select(city_code, weather_filename) %>%
  deframe() %>%
  map(~ readxl::read_excel(here(.)))

find_good_times <- function(data) {
  data %>%
    select(time, contains("emperature")) %>%
    filter(temperature >= 14)
}

good_times <-
  input_data %>%
  map(find_good_times)

How to combine the abilities of map() & co., which work on vectors and lists, with our commonly used data structure, the tibble?

We start with the named list of tibbles called input_data from section “Processing all files” and with dict from section “Named vectors and two-column tibbles”.

Calling enframe() to produce a data frame from input_data leads to a maybe at first surprising, but oftentimes useful result:

nested_input_data <-
  input_data %>%
  enframe()

nested_input_data

## # A tibble: 4 x 2
##   name     value             
##   <chr>    <list>            
## 1 berlin   <tibble [49 × 18]>
## 2 toronto  <tibble [49 × 18]>
## 3 tel_aviv <tibble [49 × 17]>
## 4 zurich   <tibble [49 × 18]>

This is because lists are also vectors. In our case each list entry contains a tibble, which can be “nested” into each entry of column value.

Starting with the tibble dict we can see how dpylr::mutate() and map() can nicely work together to produce a somewhat similar result:

dict %>%
  select(city_code, weather_filename) %>%
  mutate(
    data = map(weather_filename, ~ readxl::read_excel(here(.)))
  )

## # A tibble: 4 x 3
##   city_code weather_filename           data              
##   <chr>     <chr>                      <list>            
## 1 berlin    data/weather/berlin.xlsx   <tibble [49 × 18]>
## 2 toronto   data/weather/toronto.xlsx  <tibble [49 × 18]>
## 3 tel_aviv  data/weather/tel_aviv.xlsx <tibble [49 × 17]>
## 4 zurich    data/weather/zurich.xlsx   <tibble [49 × 18]>

This works because R interprets columns of tibbles as vectors, which can be fed to map(). To simplify the map() call, we create an intermediate column:

dict %>%
  select(city_code, weather_filename) %>%
  mutate(path = here(weather_filename)) %>%
  mutate(data = map(path, readxl::read_excel))

## # A tibble: 4 x 4
##   city_code weather_filename      path                           data      
##   <chr>     <chr>                 <chr>                          <list>    
## 1 berlin    data/weather/berlin.… /home/travis/build/krlmlr/tid… <tibble […
## 2 toronto   data/weather/toronto… /home/travis/build/krlmlr/tid… <tibble […
## 3 tel_aviv  data/weather/tel_avi… /home/travis/build/krlmlr/tid… <tibble […
## 4 zurich    data/weather/zurich.… /home/travis/build/krlmlr/tid… <tibble […

Staying in “tibble-land” as long as possible helps retaining other important components of the data you are processing, so that you can keep using familiar data transformation tools.

dict_data <-
  dict %>%
  mutate(
    data = map(weather_filename, ~ readxl::read_excel(here(.))),
    rows = map_int(data, nrow),
  ) %>%
  select(-weather_filename)
dict_data

## # A tibble: 4 x 6
##   city_code name        lng   lat data                rows
##   <chr>     <chr>     <dbl> <dbl> <list>             <int>
## 1 berlin    Berlin    13.4   52.5 <tibble [49 × 18]>    49
## 2 toronto   Toronto  -79.4   43.7 <tibble [49 × 18]>    49
## 3 tel_aviv  Tel Aviv  34.8   32.1 <tibble [49 × 17]>    49
## 4 zurich    Zürich     8.54  47.4 <tibble [49 × 18]>    49

This pattern can also be used with the map2() family of functions:

dict_data_with_desc <-
  dict_data %>%
  mutate(
    desc = map2_chr(
      name, rows,
      ~ paste0(..2, " rows in data for ", ..1)
    )
  )

Because mutate() always appends to the end, the most recently added column can always be accessed with pull():

dict_data_with_desc %>%
  pull()

## [1] "49 rows in data for Berlin"   "49 rows in data for Toronto" 
## [3] "49 rows in data for Tel Aviv" "49 rows in data for Zürich"

More generally, pmap() supports functions with an arbitrary number of arguments:

dict_data %>%
  mutate(
    cols = map_int(data, ncol),
    desc = pmap_chr(
      list(name, rows, cols),
      ~ paste0(..2, " rows and ", ..3, " cols in data for ", ..1)
    )
  )

## # A tibble: 4 x 8
##   city_code name       lng   lat data        rows  cols desc               
##   <chr>     <chr>    <dbl> <dbl> <list>     <int> <int> <chr>              
## 1 berlin    Berlin   13.4   52.5 <tibble […    49    18 49 rows and 18 col…
## 2 toronto   Toronto -79.4   43.7 <tibble […    49    18 49 rows and 18 col…
## 3 tel_aviv  Tel Av…  34.8   32.1 <tibble […    49    17 49 rows and 17 col…
## 4 zurich    Zürich    8.54  47.4 <tibble […    49    18 49 rows and 18 col…

4.2.1 Exercises

The imap() family of functions iterates over a vector and its names:

input_data %>% 
  imap_chr(~ paste0(.y, ": ", nrow(.x), " rows"))

##              berlin             toronto            tel_aviv 
##   "berlin: 49 rows"  "toronto: 49 rows" "tel_aviv: 49 rows" 
##              zurich 
##   "zurich: 49 rows"

Implement the same functionality using map2() inside a mutate(), and enframe():

good_times %>%
  ___() %>%
  mutate(___ = map2()) %>%
  deframe()

4.3 Nesting and unnesting

Click here to show setup code.

library(tidyverse)
library(here)

dict <- readxl::read_excel(here("data/cities.xlsx"))

dict_data <-
  dict %>%
  mutate(data = map(weather_filename, ~ readxl::read_excel(here(.)))) %>%
  select(-weather_filename)

How to work with nested data?

We start with the tibble dict_data from section “Moving to tibble-land”, which includes the nested tibbles in its column data.

If we want to actually look at the data we can directly use tidyr::unnest() on the whole tibble, which by default acts on all list-columns. This expands our tibble by repeating the formerly unnested column entries as many times, as each nested tibble has rows:

dict_data %>%
  unnest()

## # A tibble: 196 x 22
##   city_code name    lng   lat time                summary icon 
##   <chr>     <chr> <dbl> <dbl> <dttm>              <chr>   <chr>
## 1 berlin    Berl…  13.4  52.5 2019-04-28 15:00:00 Mostly… part…
## 2 berlin    Berl…  13.4  52.5 2019-04-28 16:00:00 Mostly… part…
## 3 berlin    Berl…  13.4  52.5 2019-04-28 17:00:00 Mostly… part…
## # … with 193 more rows, and 15 more variables: precipIntensity <dbl>,
## #   precipProbability <dbl>, temperature <dbl>, apparentTemperature <dbl>,
## #   dewPoint <dbl>, humidity <dbl>, pressure <dbl>, windSpeed <dbl>,
## #   windGust <dbl>, windBearing <dbl>, cloudCover <dbl>, uvIndex <dbl>,
## #   visibility <dbl>, ozone <dbl>, precipType <chr>

This is very similar to bind_rows() of the data column.

dict_data %>%
  pull(data) %>%
  bind_rows()

## # A tibble: 196 x 18
##   time                summary icon  precipIntensity precipProbabili…
##   <dttm>              <chr>   <chr>           <dbl>            <dbl>
## 1 2019-04-28 15:00:00 Mostly… part…               0                0
## 2 2019-04-28 16:00:00 Mostly… part…               0                0
## 3 2019-04-28 17:00:00 Mostly… part…               0                0
## # … with 193 more rows, and 13 more variables: temperature <dbl>,
## #   apparentTemperature <dbl>, dewPoint <dbl>, humidity <dbl>,
## #   pressure <dbl>, windSpeed <dbl>, windGust <dbl>, windBearing <dbl>,
## #   cloudCover <dbl>, uvIndex <dbl>, visibility <dbl>, ozone <dbl>,
## #   precipType <chr>

check_columns_same <- function(x, y) {
  stopifnot(identical(colnames(x), colnames(y)))
}

bind_rows <- function(data_frames) {
  # Called for the side effect
  reduce(data_frames, check_columns_same)

  dplyr::bind_rows(data_frames)
}

try(
  dict_data %>%
    pull(data) %>%
    bind_rows()
)

## Error in fn(out, elt, ...) : 
##   identical(colnames(x), colnames(y)) is not TRUE

Data flattened in this way is useful if the parts can be combined naturally into a larger dataset. Iterating over columns in the nested view corresponds to grouped operations in the flat view.

dict_data %>%
  mutate(n = map_int(data, nrow)) %>%
  select(-data)

## # A tibble: 4 x 5
##   city_code name        lng   lat     n
##   <chr>     <chr>     <dbl> <dbl> <int>
## 1 berlin    Berlin    13.4   52.5    49
## 2 toronto   Toronto  -79.4   43.7    49
## 3 tel_aviv  Tel Aviv  34.8   32.1    49
## 4 zurich    Zürich     8.54  47.4    49

dict_data %>%
  unnest() %>%
  count(name)

## # A tibble: 4 x 2
##   name         n
##   <chr>    <int>
## 1 Berlin      49
## 2 Tel Aviv    49
## 3 Toronto     49
## 4 Zürich      49

Inversely, if you want to have a more condensed view of your data, you can nest again. By default, the function tidyr::nest() will nest all data. Therefore it is often useful to tell it, which columns to ignore:

dict_data %>%
  unnest() %>%
  nest(-city_code, -name, -lng, -lat)

## # A tibble: 4 x 5
##   city_code name        lng   lat data              
##   <chr>     <chr>     <dbl> <dbl> <list>            
## 1 berlin    Berlin    13.4   52.5 <tibble [49 × 18]>
## 2 toronto   Toronto  -79.4   43.7 <tibble [49 × 18]>
## 3 tel_aviv  Tel Aviv  34.8   32.1 <tibble [49 × 18]>
## 4 zurich    Zürich     8.54  47.4 <tibble [49 × 18]>

Using this, we structure our data in new, customized ways. For processing of daily data over all cities, we create a new column date:

dict_data %>%
  unnest() %>%
  mutate(date = as.Date(time)) %>%
  nest(-date)

## # A tibble: 3 x 2
##   date       data              
##   <date>     <list>            
## 1 2019-04-28 <tibble [36 × 22]>
## 2 2019-04-29 <tibble [96 × 22]>
## 3 2019-04-30 <tibble [64 × 22]>

4.3.1 Exercises

Implement the following code as a mapping over a nested tibble. Use a helper function:

iris %>% 
  group_by(Species) %>% 
  summarize_all(list(Mean = mean)) %>% 
  ungroup()

## # A tibble: 3 x 5
 ##   Species Sepal.Length_Me… Sepal.Width_Mean Petal.Length_Me…
 ##   <fct>              <dbl>            <dbl>            <dbl>
 ## 1 setosa              5.01             3.43             1.46
 ## 2 versic…             5.94             2.77             4.26
 ## 3 virgin…             6.59             2.97             5.55
 ## # … with 1 more variable: Petal.Width_Mean <dbl>

summarize_to_mean <- function(data) {
  data %>% 
    ___(_____)
}

iris %>%
  nest(___) %>% 
  mutate(data = map(___, summarize_to_mean)) %>%
  unnest()

When is a grouped operation preferable over nesting? Discuss.

Data frames are lists under the hood. Explain the output of the following code. What use cases can you imagine?

dict_data %>% 
  as.list() %>% 
  enframe()

## # A tibble: 5 x 2
 ##   name      value     
 ##   <chr>     <list>    
 ## 1 city_code <chr [4]> 
 ## 2 name      <chr [4]> 
 ## 3 lng       <dbl [4]> 
 ## 4 lat       <dbl [4]> 
 ## 5 data      <list [4]>