class: center, middle, inverse, title-slide # tidyverse, DBI, and other adventures ### Kirill Müller ### 2017-09-11 --- background-image: url(images/giants.jpg) background-size: cover ??? - who's familiar with DBI? - with tidyverse? - who can explain unquote-splice? - who has used R? - implemented a project entirely in R? - developed an R package? - Windows, OS X, Linux? - standing on the shoulders of giants - impossible without R and its ecosystem - a few slides about earlier projects - DBI, tidyverse contributions, and more recent projects - outlook --- background-image: url(images/hpml.png) background-size: 140% ??? - computer science background - fast shortest paths, ~20µs per query between any two nodes in Europe - long preprocessing times --- class: center, middle <iframe allowfullscreen="" frameborder="0" height="425" mozallowfullscreen="" src="http://player.vimeo.com/video/57069805" webkitallowfullscreen="" width="750"></iframe> [(link)](https://player.vimeo.com/video/57069805) .left[ .pull-left[ ![IVT](images/ivt.png) ] ] .right[ .pull-right[ ![ETH](images/eth.png) ] ] ??? - joined the IVT - MATSim -- transport simulation - sure they must have some shortest paths to compute - project -> thesis topic - first exposure to "real" data and "real" problems --- background-image: url(images/here_to_help.png) class: bottom, right ### Source: xkcd ??? - starting the thesis - chose R over Python --- # Why R? ```r model_frame <- data.frame(x = 1:10, y = 3:-6 + rnorm(10, 0.2)) *lm(y ~ x, model_frame) ``` ``` ## ## Call: ## lm(formula = y ~ x, data = model_frame) ## ## Coefficients: ## (Intercept) x ## 4.095 -1.024 ``` ??? - When I saw this, I was essentially sold --- class: bottom, right background-image: url(images/sac.jpg) background-size: 102% ### Source: Safari Books Online ??? Explain choices: - prefer SAC over data manipulation in loops - Prefer ggplot2 to base plotting from day one - flashpoint: a tweak that was ridiculously difficult to achieve with base plotting - prefer pure functions over state - flashpoint: code stopped working when substituting data.frame with data.table - talent to quickly advance to the limitations of any given system - couldn't resist the temptation to fix things at the root - contribution is very easy and pleasant with GitHub --- # SwissCommunes .center[ ![mun](images/mun.jpg) ] .right[ ### Source: Avenir Suisse ] --- # wrswoR ```r sample_int_rank <- function(n, size, prob) { head(order(rexp(n) / prob), size) } ``` .center[ ![Median run times](index_files/figure-html/run-time-log-1.png) ] ??? - Reservoir sampling - Surprisingly elegant solution to a seemingly difficult problem --- class: middle .pull-left[ ![done](images/done.png) ] .pull-right[ ![r-consortium](images/rconsortium.png) ![rstudio](images/rstudio.png) ] ??? - at some point, thesis done - starting involvement with RStudio and R Consortium - thanks to patience of my advisor --- background-image: url(images/deps.svg) background-size: 100% ??? - DBI packages, explain --- background-image: url(images/mole.gif) background-size: 100% ??? - This is how it feels to develop an interface, a spec, and an implementation --- background-image: url(images/spec.png) background-size: 130% --- background-image: url(images/spec-code.png) background-size: 100% class: inverse --- background-image: url(images/roxypatch.png) background-size: 130% ??? - example of a patch where *simplifying* code solves the problem --- background-image: url(images/brushthat.gif) background-size: 55% background-position: 70% 0% # brushthat ??? - task: consistent formatting of error messages in dplyr - problem: rerunning tests, and finding the source of the error - first Shiny experience - thanks to Dean Attaali for helping with a few tricky problems, and for advice which hasn't found its way to the code yet --- background-image: url(images/pave-track.gif) background-size: 160% ??? - paving your own road, building your own tools --- background-image: url(images/tidyverse.png) background-size: 102% ??? - each package has its own application - why redo work? everything has been available before! - *consistency!* - simple API, each function does one thing (but does it well) - composition - pure functions *or* change state - verb-based, data always the first argument - works best with pipe (except ggplot2) --- background-image: url(images/pipe.jpg) background-size: 105% class: inverse, middle, center # <p style="font-size:300px"><code>%>%</code></p> ??? - interact: who's been using the pipe? --- class: inverse, middle, center # Select all United Airlines flights with a scheduled departure time before 6:00 AM that arrived after 10:00 PM and had a delay of more than two hours, originating in one of New York City's airports. --- class: middle 1. `carrier == "UA"` 2. `sched_dep_time < 600` 3. `arr_time > 1000` 4. `arr_delay > 120` 5. `origin %in% c("EWR", "LGA", "JFK")` --- ```r library(nycflights13) my_flights <- flights[flights$carrier == "UA", ] my_flights <- flights[flights$sched_dep_time < 600, ] my_flights <- flights[flights$arr_time > 1000, ] my_flights <- flights[flights$arr_delay > 120, ] my_flights <- flights[flights$origin %in% c("EWR", "LGA", "JFK"), ] my_flights ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_t… sche… dep_… arr_… sche… arr_d… carr… ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 1 1 517 515 2.00 830 819 11.0 UA ## 2 2013 1 1 533 529 4.00 850 830 20.0 UA ## 3 2013 1 1 542 540 2.00 923 850 33.0 AA ## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 ## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL ## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA ## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 ## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV ## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 ## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA ## # ... with 336,766 more rows, and 9 more variables: ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm> ``` --- ```r library(nycflights13) my_flights <- my_flights[my_flights$carrier == "UA", ] ``` ``` ## Error in eval(expr, envir, enclos): object 'my_flights' not found ``` ```r my_flights <- my_flights[my_flights$sched_dep_time < 600, ] my_flights <- my_flights[my_flights$arr_time > 1000, ] my_flights <- my_flights[my_flights$arr_delay > 120, ] my_flights <- my_flights[my_flights$origin %in% c("EWR", "LGA", "JFK"), ] my_flights ``` --- ```r library(nycflights13) my_flights <- flights my_flights <- my_flights[my_flights$carrier == "UA", ] my_flights <- my_flights[my_flights$sched_dep_time < 600, ] my_flights <- my_flights[my_flights$arr_time > 1000, ] my_flights <- my_flights[my_flights$arr_delay > 120, ] my_flights <- my_flights[my_flights$origin %in% c("EWR", "LGA", "JFK"), ] my_flights ``` ``` ## # A tibble: 3 x 19 ## year month day dep_t… sched… dep_d… arr_… sche… arr_… carr… ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 1 2 833 558 155 1018 727 171 UA ## 2 2013 3 11 752 530 142 1114 827 167 UA ## 3 2013 7 29 749 559 110 1104 902 122 UA ## # ... with 9 more variables: flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` ??? Problem: `my_flights` is repeated three times in each line --- ```r library(nycflights13, dplyr) my_flights <- flights my_flights <- filter(my_flights, carrier == "UA") my_flights <- filter(my_flights, sched_dep_time < 600) my_flights <- filter(my_flights, arr_time > 1000) my_flights <- filter(my_flights, arr_delay > 120) my_flights <- filter(my_flights, origin %in% c("EWR", "LGA", "JFK")) my_flights ``` ``` ## # A tibble: 3 x 19 ## year month day dep_t… sched… dep_d… arr_… sche… arr_… carr… ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 1 2 833 558 155 1018 727 171 UA ## 2 2013 3 11 752 530 142 1114 827 167 UA ## 3 2013 7 29 749 559 110 1104 902 122 UA ## # ... with 9 more variables: flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ```r library(nycflights13, dplyr) my_flights <- flights %>% filter(carrier == "UA") %>% filter(sched_dep_time < 600) %>% filter(arr_time > 1000) %>% filter(arr_delay > 120) %>% filter(origin %in% c("EWR", "LGA", "JFK")) my_flights ``` ``` ## # A tibble: 3 x 19 ## year month day dep_t… sched… dep_d… arr_… sche… arr_… carr… ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 1 2 833 558 155 1018 727 171 UA ## 2 2013 3 11 752 530 142 1114 827 167 UA ## 3 2013 7 29 749 559 110 1104 902 122 UA ## # ... with 9 more variables: flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` ??? Advantages: 1. Readability by avoiding repetition 2. Maintainability Discussion: Convinced everyone to use the pipe? --- background-image: url(images/data-science.png) background-size: 100% class: bottom, right ### Source: r4ds --- class: inverse, middle, center # Everything is a `data.frame` ??? Unlike the predecessor `plyr` - (named) vectors - lists - matrices - nested tibbles - pure functions - programmability without actually programming - "In this framework, very few data transformation problems actually require programming" --- class: inverse, middle, center # Everything is a **`tibble`** --- # tibble = sturdy data frame 1. `stringsAsFactors = FALSE` 2. Subsetting always returns a tibble 3. Better printing ??? 1. Sensible choice back then, now questionable 2. To avoid surprises 3. Designed to fit one screen, colored output coming soon --- background-image: url(images/pillar.jpg) background-size: 85% ??? Achieved by a helper package - lighter-weight dependency --- background-image: url(images/colonnade.jpg) background-size: 100% background-position: 0% 0% --- class: inverse, center, middle # dplyr + DBI = dbplyr --- background-image: url(images/styler.gif) background-size: 100% background-position: 0% 77% # styler ??? - this year's GSoC project, just finished - awesome work by Lorenz Walthert --- class: inverse, center, middle # If the first line of your #rstats script is # setwd("C:\Users\jenny\path\only\I\have") # I will come into your lab and SET YOUR COMPUTER ON FIRE. --- background-image: url(images/here.jpg) background-size: 105% class: inverse # here ??? - clearly one of my most important contributions --- class: inverse, center, middle # `remake::make()` ??? Outlook: - Pipe-based DSL - Meta-workflows - CRAN release --- class: inverse, center, middle # datatools LLC ??? - sponsoring - consulting - teaching --- background-image: url(https://avatars1.githubusercontent.com/u/1741643?v=3&s=460) background-size: 20% background-position: 0% 100% class: center, middle # Thanks! Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). # https://goo.gl/Tma2zm ### https://github.com/krlmlr, @krlmlr ??? --- class: inverse