vignettes/implement.Rmd
implement.Rmd
The timings for the fst package looked very promising, and I decided to give it a try. When I found out that date-time values are stored, I was sold: I decided to use fst instead of SQLite as a storage for medium-sized intermediate data. At the same time I wanted to keep using dplyr verbs to access the data.
A dplyr interface is scheduled for fst 0.9.0, but I didn’t want to wait that long. I decided to implement the bare minimum that was necessary to grab columns of my choice from a .fst
file, and perhaps allow easy access to all .fst
files in a directory. The result is version 0.1-1 of the fstplyr package.
I had to implement:
A constructor, src_fst()
, that accepts a path
to an existing directory. (This is the only exported function, apart from the functions reexported from dplyr.) This function calls dplyr::src(subclass = "fst")
(which really should be named dplyr::new_src()
) and already collects the metadata of all .fst
files. The function returns an object of class "src_fst"
.
The method src_tbls.src_fst()
, which enumerates the tables in a source.
The method format.src_fst()
for nice formatting, borrowed code from dbplyr.
The method tbl.src_fst()
for opening a table, this method calls dplyr::make_tbl(subclass = "fst")
(which again should be named dplyr::new_tbl()
). The function returns an object of class "tbl_fst"
which contains the previously fetched metadata.
Methods head.tbl_fst()
, dim.tbl_fst()
and later dimnames.tbl_fst()
for displaying the data.
A collect.tbl_fst()
method for fetching the entire data frame. (The head()
and collect()
methods both call fst::read_fst()
.)
A select.tbl_fst()
and rename.tbl_fst()
method that return a "tbl_fst"
object with a modified column list, computed with the help of the tidyselect package.
All other dplyr methods for which a data.frame
method exists, determined with
grep S3method.*[^_],data.frame dplyr/NAMESPACE
All these methods call collect()
on the input and then the data.frame()
version. This isn’t foolproof nor fast, but should give a working implementation for most methods with very little effort.
Some methods had to be reexported from dplyr to get rid of a R CMD check
warning (thanks to Davis Vaughan for the hint):
checking S3 generic/method consistency ... WARNING
Warning: declared S3 method 'filter.tbl_fst' not found
Warning: declared S3 method 'intersect.tbl_fst' not found
Warning: declared S3 method 'setdiff.tbl_fst' not found
Warning: declared S3 method 'setequal.tbl_fst' not found
Warning: declared S3 method 'union.tbl_fst' not found
See section ‘Generic functions and methods’ in the ‘Writing R
Extensions’ manual.
R CMD check results
0 errors | 1 warning | 0 notes
Reexporting these methods came with the additional benefits of warning about argument name mismatch, and also allows working only with loading fstplyr, so I decided to reexport all dplyr methods.
Finally, I reexported the pipe operator %>%
, just for completeness and to simplify the examples.
So far, only head()
, select()
, and rename()
are more efficient than working on data frames. Implementing versions of the other verbs that operate directly on the .fst
file is now only a matter of diligence. In particular, slice()
and filter()
should be fairly easy to implement, the latter perhaps with the help of the bindr package.