class: center, middle, inverse, title-slide # Data visualisation, reporting, and processing with R ## Closing remarks ### Kirill Müller, cynkra GmbH --- # EDA Understand your data - Generate questions - Search for answers - Rinse and repeat --- # Important questions - Variation - Covariation ??? Tabular data - Distribution - Typica values - Unusual values, outliers - Missing values --- # Discrete variables .pull-left[ ```r ggplot(data = mpg) + geom_bar( mapping = aes(x = class) ) ``` ![](outro_files/figure-html/outro-1-1.png)<!-- --> ] .pull-right[ ```r mpg %>% count(class) ``` ``` ## # A tibble: 7 x 2 ## class n ## <chr> <int> ## 1 2seater 5 ## 2 compact 47 ## 3 midsize 41 ## 4 minivan 11 ## 5 pickup 33 ## 6 subcompact 35 ## 7 suv 62 ``` ] --- # Continuous variables .pull-left[ ```r ggplot(data = mpg) + geom_histogram( mapping = aes(x = displ), binwidth = 0.05 ) ``` ![](outro_files/figure-html/outro-3-1.png)<!-- --> ] .pull-right[ ```r ggplot(data = mpg) + geom_density( mapping = aes(x = displ) ) ``` ![](outro_files/figure-html/outro-4-1.png)<!-- --> ] --- # Categorical vs. continuous variables ```r ggplot(data = mpg) + geom_density( mapping = aes(x = displ, y = ..scaled.., color = class), ) ``` ![](outro_files/figure-html/outro-5-1.png)<!-- --> --- # Categorical vs. continuous variables ```r ggplot(data = mpg) + geom_boxplot( mapping = aes(x = class, y = displ), ) ``` ![](outro_files/figure-html/outro-6-1.png)<!-- --> --- # Categorical variables .pull-left[ ```r ggplot(data = mpg) + geom_bin2d( mapping = aes(x = drv, y = class), ) ``` ![](outro_files/figure-html/outro-7-1.png)<!-- --> ] .pull-right[ ```r mpg %>% count(drv, class) ``` ``` ## # A tibble: 12 x 3 ## drv class n ## <chr> <chr> <int> ## 1 4 compact 12 ## 2 4 midsize 3 ## 3 4 pickup 33 ## 4 4 subcompact 4 ## 5 4 suv 51 ## 6 f compact 35 ## 7 f midsize 38 ## 8 f minivan 11 ## 9 f subcompact 22 ## 10 r 2seater 5 ## 11 r subcompact 9 ## 12 r suv 11 ``` ] --- # Continuous variables ```r ggplot(data = mpg) + geom_jitter( mapping = aes(x = hwy, y = cty), alpha = 0.3 ) ``` ![](outro_files/figure-html/outro-9-1.png)<!-- --> --- class: inverse ??? --- # More transformations - Grouped `mutate()` and `filter()`, see r4ds 5.7.1 - Scoped functions, see `?scoped` - `complete()` and `fill()` - `extract()` ## Same syntax for working with databases - *dbplyr* package - Transformation operations are translated to SQL --- # Joins r4ds, chapter 13 ```r flights %>% select(year, month, day, carrier) %>% left_join(airlines) ``` ``` ## Joining, by = "carrier" ``` ``` ## # A tibble: 336,776 x 5 ## year month day carrier name ## <int> <int> <int> <chr> <chr> ## 1 2013 1 1 UA United Air Lines Inc. ## 2 2013 1 1 UA United Air Lines Inc. ## 3 2013 1 1 AA American Airlines Inc. ## 4 2013 1 1 B6 JetBlue Airways ## 5 2013 1 1 DL Delta Air Lines Inc. ## 6 2013 1 1 UA United Air Lines Inc. ## 7 2013 1 1 B6 JetBlue Airways ## 8 2013 1 1 EV ExpressJet Airlines Inc. ## 9 2013 1 1 B6 JetBlue Airways ## 10 2013 1 1 AA American Airlines Inc. ## # … with 336,766 more rows ``` --- # Nested data frames r4ds, chapter 25 ```r flights %>% nest(-month) %>% arrange(month) ``` ``` ## Warning: All elements of `...` must be named. ## Did you want `data = c(year, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, ## arr_delay, carrier, flight, tailnum, origin, dest, air_time, ## distance, hour, minute, time_hour)`? ``` ``` ## # A tibble: 12 x 2 ## month data ## <int> <list<df[,18]>> ## 1 1 [27,004 × 18] ## 2 2 [24,951 × 18] ## 3 3 [28,834 × 18] ## 4 4 [28,330 × 18] ## 5 5 [28,796 × 18] ## 6 6 [28,243 × 18] ## 7 7 [29,425 × 18] ## 8 8 [29,327 × 18] ## 9 9 [27,574 × 18] ## 10 10 [28,889 × 18] ## 11 11 [27,268 × 18] ## 12 12 [28,135 × 18] ``` --- # Visualizations not covered .pull-left[ ## Position adjustments jitter, dodge, stack, nudge ## Scales labeling, color, range ] .pull-right[ ## Coordinate systems flipping, aspect ratio, polar ## Theming tweaking plots, standard appearance ] --- # Extension for ggplot2 - Useful `ggplot2` extensions, https://www.ggplot2-exts.org/gallery/ - ggstance - ggrepel - gganimate - GGally - Useful `ggplot2` themes - ggpubr - ggthemr - ggpomological --- # Pointers Working directory hell: - https://www.tidyverse.org/articles/2017/12/workflow-vs-script/ Symbolic link to data directory: - Linux and OS X: `file.symlink()` - Windows: `Sys.junction()` R markdown: http://rmarkdown.rstudio.com/gallery.html --- # Pointers 2 Literature: - Quick-R [http://www.statmethods.net](http://www.statmethods.net) - Advanced R [http://adv-r.had.co.nz](http://adv-r.had.co.nz) - Packages suggested by RStudio [https://github.com/rstudio/RStartHere](https://github.com/rstudio/RStartHere) - moderndive: [https://moderndive.com/](https://moderndive.com/)