The timings for the fst package looked very promising, and I decided to give it a try. When I found out that date-time values are stored, I was sold: I decided to use fst instead of SQLite as a storage for medium-sized intermediate data. At the same time I wanted to keep using dplyr verbs to access the data.
A dplyr interface is scheduled for fst 0.9.0, but I didn’t want to wait that long. I decided to implement the bare minimum that was necessary to grab columns of my choice from a
.fst file, and perhaps allow easy access to all
.fst files in a directory. The result is version 0.1-1 of the fstplyr package.
I had to implement:
src_fst(), that accepts a
path to an existing directory. (This is the only exported function, apart from the functions reexported from dplyr.) This function calls
dplyr::src(subclass = "fst") (which really should be named
dplyr::new_src()) and already collects the metadata of all
.fst files. The function returns an object of class
src_tbls.src_fst(), which enumerates the tables in a source.
format.src_fst() for nice formatting, borrowed code from dbplyr.
tbl.src_fst() for opening a table, this method calls
dplyr::make_tbl(subclass = "fst") (which again should be named
dplyr::new_tbl()). The function returns an object of class
"tbl_fst" which contains the previously fetched metadata.
dim.tbl_fst() and later
dimnames.tbl_fst() for displaying the data.
rename.tbl_fst() method that return a
"tbl_fst" object with a modified column list, computed with the help of the tidyselect package.
All other dplyr methods for which a
data.frame method exists, determined with
grep S3method.*[^_],data.frame dplyr/NAMESPACE
All these methods call
collect() on the input and then the
data.frame() version. This isn’t foolproof nor fast, but should give a working implementation for most methods with very little effort.
Some methods had to be reexported from dplyr to get rid of a
R CMD check warning (thanks to Davis Vaughan for the hint):
checking S3 generic/method consistency ... WARNING Warning: declared S3 method 'filter.tbl_fst' not found Warning: declared S3 method 'intersect.tbl_fst' not found Warning: declared S3 method 'setdiff.tbl_fst' not found Warning: declared S3 method 'setequal.tbl_fst' not found Warning: declared S3 method 'union.tbl_fst' not found See section ‘Generic functions and methods’ in the ‘Writing R Extensions’ manual. R CMD check results 0 errors | 1 warning | 0 notes
Reexporting these methods came with the additional benefits of warning about argument name mismatch, and also allows working only with loading fstplyr, so I decided to reexport all dplyr methods.
Finally, I reexported the pipe operator
%>%, just for completeness and to simplify the examples.
So far, only
rename() are more efficient than working on data frames. Implementing versions of the other verbs that operate directly on the
.fst file is now only a matter of diligence. In particular,
filter() should be fairly easy to implement, the latter perhaps with the help of the bindr package.