class: center, middle, inverse, title-slide # Make-like declarative workflows with
### Kirill Müller ### Zurich, 2018-03-05 --- background-image: url(images/the_difference.png) background-position: 20% 100% class: right # Reproducible ### vs. # Replicable ??? Thank you. First of all, I'd like to do a small survey, to make sure we're on the same page. Who has already been - programmed? - programmed in R? - written a function in R? - installed software? - compiled software? - used Make? - used remake? Great! This presentation focuses on the last two points, I hope this presentation will be useful to you. I'd like to start with definitions. What is a "reproducible workflow"? In the context of research, we must differentiate between "reproducible" and "replicable". - **Reproducible**: Can obtain same results from same data - Replicable: Repeating a study gives similar conclusions Image credit: [xkcd](https://xkcd.com/242/) --- background-image: url(images/the_difference.png) background-size: 120% background-position: 280% 100% class: inverse, center, right # Why? ??? Why is running on other people's computers important? - coworkers, collaborators, peer review - validation, verification, replication - different computers for development and analysis - your future self --- background-image: url(images/you.png) background-position: 50% 70% background-size: 50% class: center, inverse ??? Remember that time when you noticed that crucial data error two days prior to submission? --- background-image: url(images/conflicting.png) background-position: 50% 50% background-size: contain class: center ??? There's a certain tension between making a data analysis reproducible, amendable, and fast. Usually it's easy to achieve two of the three goals. From https://cloud.smartdraw.com/editor.aspx?templateId=bfa3f50a-8818-4119-bd10-997f2ee3f60c#depoId=8495191&credID=-21599641 --- background-image: url(dots/intro.png) background-size: contain background-position: 0% 90% # Other people's computers - hardware - software: OS version, R version and packages - virtualization - containerization - [`packrat`](https://rstudio.github.io/packrat/) - [MRAN](https://mran.microsoft.com/) - data: directory paths - open data - [`here`](https://github.com/krlmlr/here) - **workflow**: what to do how and when ??? Components of reproducibility - hardware - software: OS version, R version and packages - data: directory paths - **workflow**: what to run how and when Describe the process with a simple example: preparing a report that contains modeling results obtained from raw data. Cooking a ragout from a piece of raw meat. Vegetarians, please think of a substitute for the meat. --- background-image: url(images/manuals.png) background-position: 50% 50% background-size: contain class: center, inverse # Manuals ??? This is what XKCD has to say about manuals, I think this applies to workflow descriptions as well. - the simpler the description, the more likely successful - open data: download/harvest from web - restricted data: operate directly on raw data - cleanup scripts, avoid "manual cleaning" - model estimation, analysis, ..., final report - ideally, a single script that runs everything - of course, you're doing this already - unstructured description? --- background-image: url(dots/linear.png) background-position: 50% 50% background-size: contain class: center, bottom # A recipe ??? - complete instructions to prepare the meal - works for humans - works for computers - **does not** work well for - modification - extension - learning --- background-image: url(dots/detailed-rank.png) background-size: contain class: center, bottom # Splitting the recipe ??? - describe as a set of transformations - each step has inputs, outputs, and uses a **transformation rule** - the inputs and outputs are called **targets** - vegetables? --- background-image: url(dots/vegetables.png) background-size: contain class: center, bottom # Adding vegetables ??? - Easy to extend or modify - Rules can be reused --- background-image: url(dots/from-raw.png) background-size: contain class: center, bottom # Putting it all together ??? - Outputs can be combined easily - Arbitrary complexity --- background-image: url(dots/full.png) background-position: 50% 50% background-size: contain class: center, bottom # Dependency graph ??? - where to get food (for a truly reproducible cooking experience) - arbitrary complexity if the dependency structure doesn't contain cycles - define process as a directed acyclic graph - rule-based tools - [GNU make](https://www.gnu.org/software/make/manual/make.html) - [snakemake](https://snakemake.readthedocs.io/en/stable/) - [remake](https://github.com/richfitz/remake#readme) by Rich FitzJohn - [drake](https://github.com/ropensci/drake#readme) by Will Landau --- background-image: url(dots/detailed.png) background-size: contain background-position: 0% 90% # `Makefile` ```Makefile all: ragout ragout: fried_meat ⇥ combine --with=vegetables fried_meat > ragout fried_meat: chopped_meat ⇥ fry --with=oil chopped_meat > fried_meat chopped_meat: raw_meat ⇥ chop raw_meat > chopped_meat ``` ??? - describe rules - **order** doesn't matter - redundancy --- background-image: url(dots/detailed.png) background-size: contain background-position: 0% 90% # `Makefile` with placeholders ```Makefile all: ragout ragout: fried_meat ⇥ combine --with=vegetables $< > $@ fried_meat: chopped_meat ⇥ fry --with=oil $< > $@ chopped_meat: raw_meat ⇥ chop $< > $@ ``` --- background-image: url(dots/detailed.png) background-size: contain background-position: 0% 90% # `Makefile` with ![](images/R_logo_45.png)? ```Makefile all: ragout ragout.rds: fried_meat.rds ⇥ R -q -e 'library(cooking); fried_meat <- readRDS("$<"); ragout <- combine(fried_meat, with = vegetables); saveRDS(ragout, "$@")' fried_meat.rds: ... ``` --- background-image: url(dots/detailed.png) background-size: contain background-position: 0% 90% # `remake.yml` ```yaml packages: - cooking targets: all: depends: ragout ragout: command: combine(fried_meat, with = I("vegetables")) fried_meat: command: fry(chopped_meat, with = I("oil")) chopped_meat: command: chop("raw_meat.csv") ``` --- background-image: url(dots/detailed.png) background-size: contain background-position: 0% 90% ```r library(drake) library(cooking) plan <- drake_plan( ragout = combine(fried_meat, with = "vegetables"), fried_meat = fry(chopped_meat, with = "oil"), chopped_meat = chop(file_in("raw_meat.csv")) ) plan ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#8a8a8a"># A tibble: 3 x 2</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> target command </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="font-style:italic"><span style="color:#8a8a8a"><chr></span></span> <span style="font-style:italic"><span style="color:#8a8a8a"><chr></span></span> </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">1</span> ragout <span style="color:#8a8a8a">"</span>combine(fried_meat, with = \"vegetables\")<span style="color:#8a8a8a">"</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">2</span> fried_meat <span style="color:#8a8a8a">"</span>fry(chopped_meat, with = \"oil\")<span style="color:#8a8a8a">"</span> </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">3</span> chopped_meat <span style="color:#8a8a8a">"</span>chop(file_in(\"raw_meat.csv\"))<span style="color:#8a8a8a">"</span></div></code> ??? - the plan is a data frame - simplest case: static plan - easy to create dynamic plans --- background-image: url(dots/detailed.png) background-size: contain background-position: 0% 90%
--- class: inverse, center, middle # `drake::make(plan)` --- ```r make(plan) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">cache</span> .../.drake</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">connect</span> 3 imports: pre_chunk, colourise_chunk, plan</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">connect</span> 3 targets: ragout, fried_meat, chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 4 items: file "raw_meat.csv", chop, combine, fry</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#00afff">target</span> chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: fried_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#00afff">target</span> fried_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: ragout</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#ff5fff">unload</span> 1 item: chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#00afff">target</span> ragout</div></code> ```r readd(ragout) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">cache</span> .../.drake</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> ragout, <span style="font-style:italic">made of</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> fried meat, <span style="font-style:italic">made of</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> chopped meat, <span style="font-style:italic">made of</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> raw meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> oil</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> vegetables</div></code> ---
--- # Cooking again? ```r make(plan) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">cache</span> .../.drake</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">connect</span> 3 imports: pre_chunk, colourise_chunk, plan</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">connect</span> 3 targets: ragout, fried_meat, chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 4 items: file "raw_meat.csv", chop, combine, fry</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: fried_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: ragout</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#00afff">All targets are already up to date.</span></div></code> --- # Spice it up! ```r plan <- drake_plan( ragout = combine(fried_meat, with = "vegetables"), * fried_meat = fry(chopped_meat, with = c("oil", "pepper")), chopped_meat = chop(file_in("raw_meat.csv")) ) ``` ---
--- ```r make(plan) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">cache</span> .../.drake</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">connect</span> 3 imports: pre_chunk, colourise_chunk, plan</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">connect</span> 3 targets: ragout, fried_meat, chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 5 items: file "raw_meat.csv", c, chop, combine, fry</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: fried_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#ffaf00">load</span> 1 item: chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#00afff">target</span> fried_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">check</span> 1 item: ragout</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#ff5fff">unload</span> 1 item: chopped_meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#00afff">target</span> ragout</div></code> ```r readd(ragout) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">cache</span> .../.drake</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> ragout, <span style="font-style:italic">made of</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> fried meat, <span style="font-style:italic">made of</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> chopped meat, <span style="font-style:italic">made of</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> raw meat</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> oil </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> pepper</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> vegetables</div></code> --- class: inverse, center, middle # Running your analysis on other people's computers enables reproducibility. # # Describe your analysis as a set of<br/>**data transformations**<br/>to make it easy to run <br/>for you and for others. --- --- class: center, middle, inverse, title-slide # Make-like declarative workflows with <img src="images/R_logo_45.png" /> ## Advanced topics ### Kirill Müller ### Zurich, 2018-03-05 --- # Project directory ```r library(here) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> here() starts at ...</div></code> ```r dir(here()) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> [1] "cooking-tutorial" "cooking.Rmd" </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> [3] "docs" "dots" </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> [5] "gsp" "images" </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> [7] "index_cache" "index.Rmd" </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> [9] "Makefile" "notes.md" </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> [11] "outline.md" "packrat" </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> [13] "raw_meat.csv" "remake-slides.Rproj"</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> [15] "remake.yml"</div></code> ```r here("docs", "index.html") ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> [1] ".../docs/index.html"</div></code> --- # Project directory ```r dr_here() ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> here() starts at ..., because it contains a file matching `[.]Rproj$` with contents matching `^Version: ` in the first line</div></code> ## Files outside project directory? Use links! - URLs - Windows: `Sys.junction()` - OS X / Linux: `file.link()` - `fs::link_create()` --- # File organization ```r *drake_example("gsp") cat( system( * "tree gsp", intern = TRUE), sep = "\n" ) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> gsp</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> ├── clean.R</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> ├── interactive-tutorial.R</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> ├── make.R</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> ├── R</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> │ ├── functions.R</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> │ ├── packages.R</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> │ └── plan.R</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> ├── README.md</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> └── report.Rmd</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> 1 directory, 8 files</div></code> --- background-image: url(images/conflicting.png) background-position: 50% 50% background-size: contain # Scalability? --- # Parallel processing ```r unlink(".drake", recursive = TRUE) plan <- drake_plan( sleep_1 = Sys.sleep(1), sleep_2 = Sys.sleep(2), sleep_3 = Sys.sleep(3), sleep = list(sleep_1, sleep_2, sleep_3) ) system.time(make( plan, verbose = FALSE )) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> user system elapsed </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> 0.163 0.047 6.216</div></code> --- # Parallel processing ```r unlink(".drake", recursive = TRUE) plan <- drake_plan( sleep_1 = Sys.sleep(1), sleep_2 = Sys.sleep(2), sleep_3 = Sys.sleep(3), sleep = list(sleep_1, sleep_2, sleep_3) ) system.time(make( plan, verbose = FALSE, * jobs = 4 )) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> user system elapsed </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> 0.335 0.458 3.416</div></code> --- background-image: url(dots/detailed-validate.png) background-position: 50% 90% background-size: contain # Validation - Directly at the source - Usually can be run in parallel to further processing --- background-image: url(dots/detailed-parallel.png) background-position: 50% 90% background-size: contain # Meta-rules - `evaluate_plan()` et al. - tidy evaluation: `!!` - *dplyr* manipulation - *data.table* manipulation - base R manipulation ## Goal: Specify meta-rules in `drake_plan()` --- # Dynamic plans ```r plan <- fs::dir_ls(type = "file") %>% tibble::enframe() %>% dplyr::transmute( target = paste0("hash_", value), command = paste0("digest::digest(file_in('", value, "'))") ) plan ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#8a8a8a"># A tibble: 8 x 2</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> target command </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="font-style:italic"><span style="color:#8a8a8a"><chr></span></span> <span style="font-style:italic"><span style="color:#8a8a8a"><chr></span></span> </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">1</span> hash_Makefile digest::digest(file_in('Makefile')) </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">2</span> hash_cooking.Rmd digest::digest(file_in('cooking.Rmd'))</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">3</span> hash_index.Rmd digest::digest(file_in('index.Rmd')) </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">4</span> hash_notes.md digest::digest(file_in('notes.md')) </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">5</span> hash_outline.md digest::digest(file_in('outline.md')) </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">6</span> hash_raw_meat.csv digest::digest(file_in('raw_meat.csv'…</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">7</span> hash_remake-slides.Rproj digest::digest(file_in('remake-slides…</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">8</span> hash_remake.yml digest::digest(file_in('remake.yml'))</div></code> --- # Dynamic plans <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> Warning in repair_target_names(plan$target): replacing illegal</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> symbols in target names with '_'.</div></code>
--- # Random seeds ```r *set.seed(123) plan <- drake_plan( random = runif(5) ) make(plan, verbose = FALSE) readd(random, verbose = FALSE) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> [1] 0.5818798 0.9502068 0.4052817 0.8609804 0.6039763</div></code> ```r clean(random) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#afd7d7">cache</span> .../.drake</div></code> ```r make(plan, verbose = FALSE) readd(random, verbose = FALSE) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> [1] 0.5818798 0.9502068 0.4052817 0.8609804 0.6039763</div></code> --- # Triggers ```r plan <- drake_plan( check = print("*** Checking!"), ) %>% dplyr::mutate(trigger = "always") plan ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#8a8a8a"># A tibble: 1 x 3</span></div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> target command trigger</div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="font-style:italic"><span style="color:#8a8a8a"><chr></span></span> <span style="font-style:italic"><span style="color:#8a8a8a"><chr></span></span> <span style="font-style:italic"><span style="color:#8a8a8a"><chr></span></span> </div> <div class="remark-code-line"><span style="color:#8a8a8a">#></span> <span style="color:#b2b2b2">1</span> check <span style="color:#8a8a8a">"</span>print(\"*** Checking!\")<span style="color:#8a8a8a">"</span> always</div></code> ```r make(plan, verbose = FALSE) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> [1] "*** Checking!"</div></code> ```r make(plan, verbose = FALSE) ``` <code class="remark-code r"><div class="remark-code-line"><span style="color:#8a8a8a">#></span> [1] "*** Checking!"</div></code>