+ - 0:00:00
Notes for current slide
Notes for next slide

Data visualisation, reporting, and processing with R

Closing remarks

Kirill Müller, cynkra GmbH

2018-11-30

1 / 17

EDA

Understand your data

  • Generate questions

  • Search for answers

  • Rinse and repeat

2 / 17

Important questions

  • Variation

  • Covariation

3 / 17

Tabular data

  • Distribution
  • Typica values
  • Unusual values, outliers
  • Missing values

Discrete variables

ggplot(data = mpg) +
geom_bar(
mapping = aes(x = class)
)

mpg %>%
count(class)
## # A tibble: 7 x 2
## class n
## <chr> <int>
## 1 2seater 5
## 2 compact 47
## 3 midsize 41
## 4 minivan 11
## 5 pickup 33
## 6 subcompact 35
## 7 suv 62
4 / 17

Continuous variables

ggplot(data = mpg) +
geom_histogram(
mapping = aes(x = displ),
binwidth = 0.05
)

ggplot(data = mpg) +
geom_density(
mapping = aes(x = displ)
)

5 / 17

Categorical vs. continuous variables

ggplot(data = mpg) +
geom_density(
mapping = aes(x = displ, y = ..scaled.., color = class),
)

6 / 17

Categorical vs. continuous variables

ggplot(data = mpg) +
geom_boxplot(
mapping = aes(x = class, y = displ),
)

7 / 17

Categorical variables

ggplot(data = mpg) +
geom_bin2d(
mapping = aes(x = drv, y = class),
)

mpg %>%
count(drv, class)
## # A tibble: 12 x 3
## drv class n
## <chr> <chr> <int>
## 1 4 compact 12
## 2 4 midsize 3
## 3 4 pickup 33
## 4 4 subcompact 4
## 5 4 suv 51
## 6 f compact 35
## 7 f midsize 38
## 8 f minivan 11
## 9 f subcompact 22
## 10 r 2seater 5
## 11 r subcompact 9
## 12 r suv 11
8 / 17

Continuous variables

ggplot(data = mpg) +
geom_jitter(
mapping = aes(x = hwy, y = cty),
alpha = 0.3
)

9 / 17
10 / 17

More transformations

  • Grouped mutate() and filter(), see r4ds 5.7.1
  • Scoped functions, see ?scoped
  • complete() and fill()
  • extract()

Same syntax for working with databases

  • dbplyr package
  • Transformation operations are translated to SQL
11 / 17

Joins

r4ds, chapter 13

flights %>%
select(year, month, day, carrier) %>%
left_join(airlines)
## Joining, by = "carrier"
## # A tibble: 336,776 x 5
## year month day carrier name
## <int> <int> <int> <chr> <chr>
## 1 2013 1 1 UA United Air Lines Inc.
## 2 2013 1 1 UA United Air Lines Inc.
## 3 2013 1 1 AA American Airlines Inc.
## 4 2013 1 1 B6 JetBlue Airways
## 5 2013 1 1 DL Delta Air Lines Inc.
## 6 2013 1 1 UA United Air Lines Inc.
## 7 2013 1 1 B6 JetBlue Airways
## 8 2013 1 1 EV ExpressJet Airlines Inc.
## 9 2013 1 1 B6 JetBlue Airways
## 10 2013 1 1 AA American Airlines Inc.
## # ... with 336,766 more rows
12 / 17

Nested data frames

r4ds, chapter 25

flights %>%
nest(-month) %>%
arrange(month)
## # A tibble: 12 x 2
## month data
## <int> <list>
## 1 1 <tibble [27,004 × 18]>
## 2 2 <tibble [24,951 × 18]>
## 3 3 <tibble [28,834 × 18]>
## 4 4 <tibble [28,330 × 18]>
## 5 5 <tibble [28,796 × 18]>
## 6 6 <tibble [28,243 × 18]>
## 7 7 <tibble [29,425 × 18]>
## 8 8 <tibble [29,327 × 18]>
## 9 9 <tibble [27,574 × 18]>
## 10 10 <tibble [28,889 × 18]>
## 11 11 <tibble [27,268 × 18]>
## 12 12 <tibble [28,135 × 18]>
13 / 17

Visualizations not covered

Position adjustments

jitter, dodge, stack, nudge

Scales

labeling, color, range

Coordinate systems

flipping, aspect ratio, polar

Theming

tweaking plots, standard appearance

14 / 17

Extension for ggplot2

15 / 17

Pointers

Working directory hell:

Symbolic link to data directory:

  • Linux and OS X: file.symlink()
  • Windows: Sys.junction()

R markdown: http://rmarkdown.rstudio.com/gallery.html

16 / 17

Pointers 2

Literature:

17 / 17

EDA

Understand your data

  • Generate questions

  • Search for answers

  • Rinse and repeat

2 / 17
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow