Data visualisation, reporting, and processing with R

# Data visualisation, reporting, and processing with R
## Supporting slides
### Kirill Müller, cynkra GmbH

---

# Survey results

## Purpose of R: Automation, interactive work, toolsets

## Some previous experience with R

### Excel, SPSS/Stata/SAS/Access/databases

### Some programming experience, e.g. C/C++, VB/VBA, C, FORTRAN, Java

### Little experience with shell/command line and VCS

### Windows and MacOS

---

---

background-image: url(images/data-science-explore.png)
background-size: 75%
class: bottom

#### Source: Grolemund and Wickham, R for data science

---

# Other important RStudio shortcuts

- **Focus source/console**: Ctrl + 1 / Ctrl + 2
- **Filter command history**: Start typing, then Ctrl + Cursor up
- **Search command history**: Ctrl + R, then type
- **Source with echo**: Ctrl + Shift + Enter
- **Move lines up/down**: Alt + Cursor up/down
- **Indent/outdent**: Tab / Shift + Tab
- **Find in all files**: Ctrl + Shift + F

---

# Tidy data

> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

## Definition

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

![](https://r4ds.had.co.nz/images/tidy-1.png)

Source: Grolemund and Wickham, R for data science

???

Source: R4DS

---

# Aesthetics

- `x`, `y`
- `shape`
- `size`
- `alpha`
- `text`
- `color` vs. `fill`
- `group`

Placed inside `aes()`: Map to variable, show legend

Placed outside `aes()`: Change for all points, no legend

---

# Graphing template

```r
ggplot(
  data = .DATA.,
  aes(x = ..., y = ..., ...)
) +
  .GEOM.(
    data = .DATA.,
    aes(x = ..., y = ..., ...),
    .AES. = .CONST.,
    position = .POSITION.
  ) +
  .STAT.(...) +
  .FACET.(...)
```

---

# Data transformation

## One table

- `filter()`
- `select()` / `rename()`
- `arrange()`
- `mutate()` / `transmute()`
- `summarise()`

## Grouped operations

- `group_by()`

## Joins

- `xxx_join()`

---

# Filter criteria

- Operators: `==`, `!=`, `<`, `>`, `<=`, `>=`
    ```r
    month == 3              # careful: two =
    month >= 10
    carrier != "UA"         # careful: <> doesn't work
    arr_time < dep_time
    ```
- `near()`
    ```r
    near(sin(pi), 0)
    ```
- `between()`, `%in%`
    ```r
    between(hour, 8, 12)
    month %in% c(12, 1, 2)
    ```

---

# Filter criteria for strings

- Operators: `==`, `!=`, `<`, `>`, `<=`, `>=`
- Searching for pattern:
    ```r
    library(stringr)
    
    str_detect(tailnum, "^[^N]")
    str_detect(carrier, fixed("A"))
    ```

---

# Combining filter criteria

- Operators: `&`, `|`, `!`
    ```r
    (month == 5) & between(day, 17, 18)
    (month == 3) | (month == 4)
    !between(month, 3, 6)
    ```
- Missing values
    ```r
    is.na(arr_time)
    is.na(NA + 3)
    is.na(!NA)
    is.na(0)
    ```

---

# Selection helpers

## By name

- `. %>% select(var1, var2)`
- `. %>% select(var1, everything())`
- `. %>% select(ends_with("delay"))`

## Range

- `. %>% select(var1:var2)`
- `. %>% select(-var1:-var2)`

## By position

- `. %>% select(1:5)`

---

# Sorting data

- `NA` sorts last
- Use `desc()` to reverse sorting order

---

# Mutation functions

- Arithmetic: `+`, `-`, `*`, `/`, `^`, `%%`, `%/%`
    ```r
    dep_delay - arr_delay
    dep_time %/% 100
    dep_time %% 100
    dep_delay - mean(dep_delay)   # See next slide
    ```
- Real functions, see `?base::Math` and `?dplyr::lead`:
    - Rounding: `floor()`, `ceiling()`, `round()`
    - Sign: `abs()`, `sign()`
    - Transform: `sqrt()`, `log()`, `log2()`, `exp()`
    - Trigonometric: `sin()` etc.
    - Cumulative: `cumsum()` etc.
    - Lead and lag: `lead()`, `lag()`
    
- Recoding: `if_else()`, `case_when()`, `recode()`

- All filtering functions for a new `logical` column

- Ranking: `row_number()`, `min_rank()`, `ntile()`

---

# Mutation function for strings

- Replacing by pattern:
    ```r
    library(stringr)
    
    str_replace(origin, "GA", "XX")
    ```

---

# Aggregation functions

## Statistics

- `sum()`, `prod()`
    - `na.rm = TRUE`
- `mean()`, `median()`
- `sd()`, `IQR()`, `mad()`
- `min()`, `quantile(0.75)`, `max()`
- `sum()` and `mean()` for `logical` variables:
    ```r
    mean(is.na(arr_time))
    ```

## Ranking

- `n()`
- `first()`, `last()`, `nth()`

---

# Graphing template, with transformation

```r
.DATA. %>%
  ... %>%
  ggplot(
    aes(x = ..., y = ..., ...)
  ) +
  .GEOM.(
    aes(x = ..., y = ..., ...),
    .AES. = .CONST.,
    position = .POSITION.
  ) +
  .STAT.(...) +
  .FACET.(...) +
  .SCALE.(...) +
  .COORD.(...) +
  .THEME.(...)
```

---

---

---

---

---

---

---

---

---

---

---

---

# Joins

- For each combination of join variables in the left data frame, find corresponding rows in the right data frame
    - Default: Join by matching variable names
    
.pull-left[

## Mutating join

- Always returns rows from left *and* right data frame
- Difference: behavior on mismatch
    - `inner_join()`: Keep only matching rows
    - `left_join()`: Keep all rows from left
    - `right_join()`: Keep all rows from right
    - `full_join()`: Keep all rows

]

## Filtering join

- Only return rows from left data frame
- Difference: Returned set
    - `semi_join()`: Keep matching rows
    - `anti_join()`: Remove matching rows

]

---

---

# Spread and gather

## spread: long to wide

- Take new column names from key column
- Distribute values across new column names

]

## gather: wide to long

- Create new key column with column names
- Fill existing data into new column

]