Implementing a dplyr backend from scratch

The timings for the fst package looked very promising, and I decided to give it a try. When I found out that date-time values are stored, I was sold: I decided to use fst instead of SQLite as a storage for medium-sized intermediate data. At the same time I wanted to keep using dplyr verbs to access the data.

A dplyr interface is scheduled for fst 0.9.0, but I didn’t want to wait that long. I decided to implement the bare minimum that was necessary to grab columns of my choice from a .fst file, and perhaps allow easy access to all .fst files in a directory. The result is version 0.1-1 of the fstplyr package.

Recipe

I had to implement:

A constructor, src_fst(), that accepts a path to an existing directory. (This is the only exported function, apart from the functions reexported from dplyr.) This function calls dplyr::src(subclass = "fst") (which really should be named dplyr::new_src()) and already collects the metadata of all .fst files. The function returns an object of class "src_fst".
The method src_tbls.src_fst(), which enumerates the tables in a source.
The method format.src_fst() for nice formatting, borrowed code from dbplyr.
The method tbl.src_fst() for opening a table, this method calls dplyr::make_tbl(subclass = "fst") (which again should be named dplyr::new_tbl()). The function returns an object of class "tbl_fst" which contains the previously fetched metadata.
Methods head.tbl_fst(), dim.tbl_fst() and later dimnames.tbl_fst() for displaying the data.
A collect.tbl_fst() method for fetching the entire data frame. (The head() and collect() methods both call fst::read_fst().)
A select.tbl_fst() and rename.tbl_fst() method that return a "tbl_fst" object with a modified column list, computed with the help of the tidyselect package.
All other dplyr methods for which a data.frame method exists, determined with
```
grep S3method.*[^_],data.frame dplyr/NAMESPACE
```
All these methods call collect() on the input and then the data.frame() version. This isn’t foolproof nor fast, but should give a working implementation for most methods with very little effort.

Some methods had to be reexported from dplyr to get rid of a R CMD check warning (thanks to Davis Vaughan for the hint):

checking S3 generic/method consistency ... WARNING
Warning: declared S3 method 'filter.tbl_fst' not found
Warning: declared S3 method 'intersect.tbl_fst' not found
Warning: declared S3 method 'setdiff.tbl_fst' not found
Warning: declared S3 method 'setequal.tbl_fst' not found
Warning: declared S3 method 'union.tbl_fst' not found
See section ‘Generic functions and methods’ in the ‘Writing R
Extensions’ manual.
R CMD check results
0 errors | 1 warning  | 0 notes

Reexporting these methods came with the additional benefits of warning about argument name mismatch, and also allows working only with loading fstplyr, so I decided to reexport all dplyr methods.

Finally, I reexported the pipe operator %>%, just for completeness and to simplify the examples.

Result

So far, only head(), select(), and rename() are more efficient than working on data frames. Implementing versions of the other verbs that operate directly on the .fst file is now only a matter of diligence. In particular, slice() and filter() should be fairly easy to implement, the latter perhaps with the help of the bindr package.

Kirill Müller

2018-04-15

Recipe

Result

Contents