class: center, middle, inverse, title-slide # Data visualisation, reporting, and processing with R ## Supporting slides ### Kirill Müller --- class: center # Survey results ## Purpose of R: Interactive work, automation ## Some previous experience with R ## SPSS and Excel ## Varying programming experience, e.g. VB/VBA, Python ## Some experience with shell/command line --- # Aesthetics - `x`, `y` - `shape` - `size` - `alpha` - `text` - `color` vs. `fill` - `group` Placed inside `aes()`: Map to variable, show legend Placed outside `aes()`: Change for all points, no legend --- # Graphing template ```r ggplot( data = .DATA., aes(x = ..., y = ..., ...) ) + .GEOM.( data = .DATA., aes(x = ..., y = ..., ...), .AES. = .CONST., position = .POSITION. ) + .STAT.(...) + .FACET.(...) ``` --- # Data transformation ## One table - `filter()` - `select()` / `rename()` - `arrange()` - `mutate()` / `transmute()` - `summarise()` ## Grouped operations ## Joins --- # Filter criteria - Operators: `==`, `!=`, `<`, `>`, `<=`, `>=` ```r month == 3 # careful: two = month >= 10 carrier != "UA" # careful: <> doesn't work arr_time < dep_time ``` - `near()` ```r near(sin(pi), 0) ``` - `between()`, `%in%` ```r between(hour, 8, 12) month %in% c(12, 1, 2) ``` --- # Combining filter criteria - Operators: `&`, `|`, `!` ```r (month == 5) & between(day, 17, 18) (month == 3) | (month == 4) !between(month, 3, 6) ``` - Missing values ```r is.na(arr_time) is.na(NA + 3) is.na(!NA) is.na(0) ``` --- # Selection helpers ## By name - `. %>% select(var1, var2)` - `. %>% select(var1, everything())` - `. %>% select(ends_with("delay"))` ## Range - `. %>% select(var1:var2)` - `. %>% select(-var1:-var2)` ## By position - `. %>% select(1:5)` --- # Sorting data - `NA` sorts last - Use `desc()` to reverse sorting order --- # Mutation functions - Arithmetic: `+`, `-`, `*`, `/`, `^`, `%%`, `%/%` ```r dep_delay - arr_delay dep_time %/% 100 dep_time %% 100 dep_delay - mean(dep_delay) # See next slide ``` - Real functions, see `?base::Math` and `?dplyr::lead`: - Rounding: `floor()`, `ceiling()`, `round()` - Sign: `abs()`, `sign()` - Transform: `sqrt()`, `log()`, `log2()`, `exp()` - Trigonometric: `sin()` etc. - Cumulative: `cumsum()` etc. - Lead and lag: `lead()`, `lag()` - Recoding: `recode()` - All filtering functions to return `logical` - Ranking: `row_number()`, `min_rank()`, `ntile()` --- # Aggregate functions ## Statistics - `sum()`, `prod()` - `na.rm = TRUE` - `mean()`, `median()` - `sd()`, `IQR()`, `mad()` - `min()`, `quantile(0.75)`, `max()` - `sum()` and `mean()` for `logical` variables: ```r mean(is.na(arr_time)) ``` ## Ranking - `n()` - `first()`, `last()`, `nth()` --- # Graphing template, with transformation ```r .DATA. %>% ... %>% ggplot( aes(x = ..., y = ..., ...) ) + .GEOM.( aes(x = ..., y = ..., ...), .AES. = .CONST., position = .POSITION. ) + .STAT.(...) + .FACET.(...) + .SCALE.(...) + .COORD.(...) + .THEME.(...) ``` --- # Joins - For each combination of join variables in the left data frame, find corresponding rows in the right data frame - Default: Join by matching variable names .pull-left[ ## Mutating join - Always returns rows from left *and* right data frame - Difference: behavior on mismatch - `inner_join()`: Keep only matching rows - `left_join()`: Keep all rows from left - `right_join()`: Keep all rows from right - `full_join()`: Keep all rows ] .pull-right[ ## Filtering join - Only return rows from left data frame - Difference: Returned set - `semi_join()`: Keep matching rows - `anti_join()`: Remove matching rows ]