class: center, middle, inverse, title-slide # Data visualisation, reporting, and processing with R ## Supporting slides ### Kirill Müller, cynkra GmbH --- class: center # Survey results ## Purpose of R: Interactive work, automation, part of a toolchain ## Some previous experience with R ## Excel, databases and SPSS/SAS ## Varying programming experience, e.g. C/C++, VB/VBA, Python, JS ## Experience with shell/command line --- background-image: url(images/learn-r.png) background-size: 80% --- background-image: url(images/data-science-explore.png) background-size: 75% class: bottom #### Source: Grolemund and Wickham, R for data science --- # Other important RStudio shortcuts - **Focus source/console**: Ctrl + 1 / Ctrl + 2 - **Filter command history**: Start typing, then Ctrl + Cursor up - **Search command history**: Ctrl + R, then type - **Source with echo**: Ctrl + Shift + Enter - **Move lines up/down**: Alt + Cursor up/down - **Indent/outdent**: Tab / Shift + Tab - **Find in all files**: Ctrl + Shift + F --- # Tidy data > “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham ## Definition 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. ![](https://r4ds.had.co.nz/images/tidy-1.png) Source: Grolemund and Wickham, R for data science ??? Source: R4DS --- # Aesthetics - `x`, `y` - `shape` - `size` - `alpha` - `text` - `color` vs. `fill` - `group` Placed inside `aes()`: Map to variable, show legend Placed outside `aes()`: Change for all points, no legend --- # Graphing template ```r ggplot( data = .DATA., aes(x = ..., y = ..., ...) ) + .GEOM.( data = .DATA., aes(x = ..., y = ..., ...), .AES. = .CONST., position = .POSITION. ) + .STAT.(...) + .FACET.(...) ``` --- # Data transformation ## One table - `filter()` - `select()` / `rename()` - `arrange()` - `mutate()` / `transmute()` - `summarise()` ## Grouped operations - `group_by()` ## Joins - `xxx_join()` --- # Filter criteria - Operators: `==`, `!=`, `<`, `>`, `<=`, `>=` ```r month == 3 # careful: two = month >= 10 carrier != "UA" # careful: <> doesn't work arr_time < dep_time ``` - `near()` ```r near(sin(pi), 0) ``` - `between()`, `%in%` ```r between(hour, 8, 12) month %in% c(12, 1, 2) ``` --- # Filter criteria for strings - Operators: `==`, `!=`, `<`, `>`, `<=`, `>=` - Searching for pattern: ```r library(stringr) str_detect(tailnum, "^[^N]") str_detect(carrier, fixed("A")) ``` --- # Combining filter criteria - Operators: `&`, `|`, `!` ```r (month == 5) & between(day, 17, 18) (month == 3) | (month == 4) !between(month, 3, 6) ``` - Missing values ```r is.na(arr_time) is.na(NA + 3) is.na(!NA) is.na(0) ``` --- # Selection helpers ## By name - `. %>% select(var1, var2)` - `. %>% select(var1, everything())` - `. %>% select(ends_with("delay"))` ## Range - `. %>% select(var1:var2)` - `. %>% select(-var1:-var2)` ## By position - `. %>% select(1:5)` --- # Sorting data - `NA` sorts last - Use `desc()` to reverse sorting order --- # Mutation functions - Arithmetic: `+`, `-`, `*`, `/`, `^`, `%%`, `%/%` ```r dep_delay - arr_delay dep_time %/% 100 dep_time %% 100 dep_delay - mean(dep_delay) # See next slide ``` - Real functions, see `?base::Math` and `?dplyr::lead`: - Rounding: `floor()`, `ceiling()`, `round()` - Sign: `abs()`, `sign()` - Transform: `sqrt()`, `log()`, `log2()`, `exp()` - Trigonometric: `sin()` etc. - Cumulative: `cumsum()` etc. - Lead and lag: `lead()`, `lag()` - Recoding: `if_else()`, `case_when()`, `recode()` - All filtering functions for a new `logical` column - Ranking: `row_number()`, `min_rank()`, `ntile()` --- # Mutation function for strings - Replacing by pattern: ```r library(stringr) str_replace(origin, "GA", "XX") ``` --- # Aggregation functions ## Statistics - `sum()`, `prod()` - `na.rm = TRUE` - `mean()`, `median()` - `sd()`, `IQR()`, `mad()` - `min()`, `quantile(0.75)`, `max()` - `sum()` and `mean()` for `logical` variables: ```r mean(is.na(arr_time)) ``` ## Ranking - `n()` - `first()`, `last()`, `nth()` --- # Graphing template, with transformation ```r .DATA. %>% ... %>% ggplot( aes(x = ..., y = ..., ...) ) + .GEOM.( aes(x = ..., y = ..., ...), .AES. = .CONST., position = .POSITION. ) + .STAT.(...) + .FACET.(...) + .SCALE.(...) + .COORD.(...) + .THEME.(...) ``` --- background-image: url(images/left-join.gif) --- class: inverse --- background-image: url(images/inner-join.gif) --- class: inverse --- background-image: url(images/full-join.gif) --- class: inverse --- background-image: url(images/right-join.gif) --- class: inverse --- background-image: url(images/semi-join.gif) --- class: inverse --- background-image: url(images/anti-join.gif) --- # Joins - For each combination of join variables in the left data frame, find corresponding rows in the right data frame - Default: Join by matching variable names .pull-left[ ## Mutating join - Always returns rows from left *and* right data frame - Difference: behavior on mismatch - `inner_join()`: Keep only matching rows - `left_join()`: Keep all rows from left - `right_join()`: Keep all rows from right - `full_join()`: Keep all rows ] .pull-right[ ## Filtering join - Only return rows from left data frame - Difference: Returned set - `semi_join()`: Keep matching rows - `anti_join()`: Remove matching rows ] --- background-image: url(images/tidyr-spread-gather.gif) --- # Spread and gather .pull-left[ ## spread: long to wide - Take new column names from key column - Distribute values across new column names ] .pull-right[ ## gather: wide to long - Create new key column with column names - Fill existing data into new column ]