+ - 0:00:00
Notes for current slide
Notes for next slide

Data visualisation, reporting, and processing with R

Supporting slides

Kirill Müller, cynkra GmbH

1 / 32

Survey results

Purpose of R: Automation, interactive work, toolsets

Some previous experience with R

Excel, SPSS/Stata/SAS/Access/databases

Some programming experience, e.g. C/C++, VB/VBA, C, FORTRAN, Java

Little experience with shell/command line and VCS

Windows and MacOS

2 / 32
3 / 32

Source: Grolemund and Wickham, R for data science

4 / 32

Other important RStudio shortcuts

  • Focus source/console: Ctrl + 1 / Ctrl + 2
  • Filter command history: Start typing, then Ctrl + Cursor up
  • Search command history: Ctrl + R, then type
  • Source with echo: Ctrl + Shift + Enter
  • Move lines up/down: Alt + Cursor up/down
  • Indent/outdent: Tab / Shift + Tab
  • Find in all files: Ctrl + Shift + F
5 / 32

Tidy data

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Definition

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Source: Grolemund and Wickham, R for data science

6 / 32

Source: R4DS

Aesthetics

  • x, y
  • shape
  • size
  • alpha
  • text
  • color vs. fill
  • group

Placed inside aes(): Map to variable, show legend

Placed outside aes(): Change for all points, no legend

7 / 32

Graphing template

ggplot(
data = .DATA.,
aes(x = ..., y = ..., ...)
) +
.GEOM.(
data = .DATA.,
aes(x = ..., y = ..., ...),
.AES. = .CONST.,
position = .POSITION.
) +
.STAT.(...) +
.FACET.(...)
8 / 32

Data transformation

One table

  • filter()
  • select() / rename()
  • arrange()
  • mutate() / transmute()
  • summarise()

Grouped operations

  • group_by()

Joins

  • xxx_join()
9 / 32

Filter criteria

  • Operators: ==, !=, <, >, <=, >=
    month == 3 # careful: two =
    month >= 10
    carrier != "UA" # careful: <> doesn't work
    arr_time < dep_time
  • near()
    near(sin(pi), 0)
  • between(), %in%
    between(hour, 8, 12)
    month %in% c(12, 1, 2)
10 / 32

Filter criteria for strings

  • Operators: ==, !=, <, >, <=, >=
  • Searching for pattern:

    library(stringr)
    str_detect(tailnum, "^[^N]")
    str_detect(carrier, fixed("A"))
11 / 32

Combining filter criteria

  • Operators: &, |, !
    (month == 5) & between(day, 17, 18)
    (month == 3) | (month == 4)
    !between(month, 3, 6)
  • Missing values
    is.na(arr_time)
    is.na(NA + 3)
    is.na(!NA)
    is.na(0)
12 / 32

Selection helpers

By name

  • . %>% select(var1, var2)
  • . %>% select(var1, everything())
  • . %>% select(ends_with("delay"))

Range

  • . %>% select(var1:var2)
  • . %>% select(-var1:-var2)

By position

  • . %>% select(1:5)
13 / 32

Sorting data

  • NA sorts last
  • Use desc() to reverse sorting order
14 / 32

Mutation functions

  • Arithmetic: +, -, *, /, ^, %%, %/%
    dep_delay - arr_delay
    dep_time %/% 100
    dep_time %% 100
    dep_delay - mean(dep_delay) # See next slide
  • Real functions, see ?base::Math and ?dplyr::lead:

    • Rounding: floor(), ceiling(), round()
    • Sign: abs(), sign()
    • Transform: sqrt(), log(), log2(), exp()
    • Trigonometric: sin() etc.
    • Cumulative: cumsum() etc.
    • Lead and lag: lead(), lag()
  • Recoding: if_else(), case_when(), recode()

  • All filtering functions for a new logical column

  • Ranking: row_number(), min_rank(), ntile()

15 / 32

Mutation function for strings

  • Replacing by pattern:

    library(stringr)
    str_replace(origin, "GA", "XX")
16 / 32

Aggregation functions

Statistics

  • sum(), prod()
    • na.rm = TRUE
  • mean(), median()
  • sd(), IQR(), mad()
  • min(), quantile(0.75), max()
  • sum() and mean() for logical variables:
    mean(is.na(arr_time))

Ranking

  • n()
  • first(), last(), nth()
17 / 32

Graphing template, with transformation

.DATA. %>%
... %>%
ggplot(
aes(x = ..., y = ..., ...)
) +
.GEOM.(
aes(x = ..., y = ..., ...),
.AES. = .CONST.,
position = .POSITION.
) +
.STAT.(...) +
.FACET.(...) +
.SCALE.(...) +
.COORD.(...) +
.THEME.(...)
18 / 32
19 / 32
20 / 32
21 / 32
22 / 32
23 / 32
24 / 32
25 / 32
26 / 32
27 / 32
28 / 32
29 / 32

Joins

  • For each combination of join variables in the left data frame, find corresponding rows in the right data frame
    • Default: Join by matching variable names

Mutating join

  • Always returns rows from left and right data frame
  • Difference: behavior on mismatch
    • inner_join(): Keep only matching rows
    • left_join(): Keep all rows from left
    • right_join(): Keep all rows from right
    • full_join(): Keep all rows

Filtering join

  • Only return rows from left data frame
  • Difference: Returned set
    • semi_join(): Keep matching rows
    • anti_join(): Remove matching rows
30 / 32
31 / 32

Spread and gather

spread: long to wide

  • Take new column names from key column
  • Distribute values across new column names

gather: wide to long

  • Create new key column with column names
  • Fill existing data into new column
32 / 32

Survey results

Purpose of R: Automation, interactive work, toolsets

Some previous experience with R

Excel, SPSS/Stata/SAS/Access/databases

Some programming experience, e.g. C/C++, VB/VBA, C, FORTRAN, Java

Little experience with shell/command line and VCS

Windows and MacOS

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow