4 Import

Ingesting data.

This chapter discusses data import with RStudio, with the help of the readr, readxl, and rio packages.

4.1 Import single files

Click here to show setup code.

library(tidyverse)
library(readr)

The RStudio IDE offers a convenient way to import files in various common formats, including CSV and Excel. The “File / Import Dataset / From …” menus provide access to import assistants that:

  1. open a file for preview,
  2. allow tweaking import options,
  3. generate R code that you can copy-paste into your scripts for further reuse.

The assistant is run once for each dataset, from then only the generated code is required to import the data in a consistent way.

This is an example of auto-generated code for importing a dataset from the data/ directory.

## Parsed with column specification:
## cols(
##   col1 = [32mcol_double()[39m,
##   col2 = [31mcol_character()[39m,
##   col3 = [31mcol_character()[39m
## )

After importing, use view() to display the ingested dataset.

view(example1)
## # A tibble: 2 x 3
##    col1 col2  col3 
##   <dbl> <chr> <chr>
## 1   1   a     X    
## 2   2.5 b     Y

4.2 Import many files

Click here to show setup code.

library(tidyverse)
library(nycflights13)

library(here)

library(conflicted)
conflict_prefer("filter", "dplyr")
## [conflicted] Removing existing preference
## [conflicted] Will prefer [34mdplyr::filter[39m over any other package

Occasionally, a dataset is split across many files with a very similar format. The data/ directory contains several Excel files with the .xlsx extension with tables of nearly identical format.

files <- dir(path = here("data"), pattern = "[.]xlsx$", full.names = TRUE)
files
## [1] "/home/travis/build/krlmlr/vistransrep/book/data/example6a.xlsx"
## [2] "/home/travis/build/krlmlr/vistransrep/book/data/example6b.xlsx"
## [3] "/home/travis/build/krlmlr/vistransrep/book/data/example6c.xlsx"

An easy way to import all files at once is the rio::import_list() function from the {rio} package.

files %>%
  rio::import_list(setclass = class(tibble()), rbind = TRUE)
## # A tibble: 6 x 5
##      id  col1 col2  col3  `_file`                             
##   <dbl> <dbl> <chr> <chr> <chr>                               
## 1     1   1   a     X     /home/travis/build/krlmlr/vistransr…
## 2     1   2.5 b     Y     /home/travis/build/krlmlr/vistransr…
## 3     2   1.5 c     Z     /home/travis/build/krlmlr/vistransr…
## 4     2   2   d     W     /home/travis/build/krlmlr/vistransr…
## 5     3   4   g     J     /home/travis/build/krlmlr/vistransr…
## 6     3   3.5 f     H     /home/travis/build/krlmlr/vistransr…

If some files need manipulation before the data can be bound together, {rio} also offers a way to import them as a “named list”.

## $example6a
## # A tibble: 2 x 4
##      id  col1 col2  col3 
##   <dbl> <dbl> <chr> <chr>
## 1     1   1   a     X    
## 2     1   2.5 b     Y    
## 
## $example6b
## # A tibble: 2 x 4
##      id  col1 col2  col3 
##   <dbl> <dbl> <chr> <chr>
## 1     2   1.5 c     Z    
## 2     2   2   d     W    
## 
## $example6c
## # A tibble: 2 x 4
##      id  col1 col2  col3 
##   <dbl> <dbl> <chr> <chr>
## 1     3   4   g     J    
## 2     3   3.5 f     H

The data can be accessed individually for each input file.

## # A tibble: 2 x 4
##      id  col1 col2  col3 
##   <dbl> <dbl> <chr> <chr>
## 1     2   1.5 c     Z    
## 2     2   2   d     W

If a tweak is necessary, the data can be overwritten as needed.

## Error in function_list[[i]](value) : '...' used in an incorrect context

The bind_rows() function combines these components into a single dataset again.

all_tables <- bind_rows(list_of_tables, .id = "path")
all_tables
## # A tibble: 6 x 5
##   path         id  col1 col2  col3 
##   <chr>     <dbl> <dbl> <chr> <chr>
## 1 example6a     1   1   a     X    
## 2 example6a     1   2.5 b     Y    
## 3 example6b     2   1.5 c     Z    
## 4 example6b     2   2   d     W    
## 5 example6c     3   4   g     J    
## 6 example6c     3   3.5 f     H

When done, use filter() to access a single dataset.

all_tables %>%
  filter(path == "example6b") %>%
  summarize(mean(col1), first(col2))
## # A tibble: 1 x 2
##   `mean(col1)` `first(col2)`
##          <dbl> <chr>        
## 1         1.75 c

For performing an analysis across the entire dataset, per input file, use group_by():

## # A tibble: 3 x 3
##   path      `mean(col1)` `first(col2)`
##   <chr>            <dbl> <chr>        
## 1 example6a         1.75 a            
## 2 example6b         1.75 c            
## 3 example6c         3.75 g

Finally, map_dfr() offers a way to import files with more control. The details are out of scope here.

## # A tibble: 6 x 4
##      id  col1 col2  col3 
##   <dbl> <dbl> <chr> <chr>
## 1     1   1   a     X    
## 2     1   2.5 b     Y    
## 3     2   1.5 c     Z    
## 4     2   2   d     W    
## 5     3   4   g     J    
## 6     3   3.5 f     H