+ - 0:00:00
Notes for current slide
Notes for next slide

DBI

Recent developments in R’s database interface

Kirill Müller, @krlmlr

2018-09-03, R meetup Zurich, https://bit.ly/2wFJix3

1 / 63

(Commercial)

2 / 63

adminR: R in Swiss Official Statistics

R users in Swiss public adminis-tration, or institutions with ties to it

  • Get to know each other
  • Share our experience
  • Learn about different use cases of R in public statistics

Thursday, September 20, 2018
5:15 PM to 7:15 PM

State Secretariat for Economic Affairs

SECO · Holzikofenweg 36 · Bern

 

(Commercial)

3 / 63

Who has ...

via GIPHY

4 / 63

Hi! Welcome everybode to my presentation.

Today I'll talk about R and databases.

Before we start, I'd like to conduct a short survey.

... worked with R?

5 / 63

... used a data frame (or tibble) in R?

6 / 63

... worked with a database?

7 / 63

A separate server whose only purpose is to store and retrieve data.

... written a SQL query?

8 / 63

... or "an Ess-Cue-Ell query"?

... connected to a database from R?

9 / 63

to retrieve some data

... connected to a database from R

using DBI ?

10 / 63

Did your code use dbConnect() ?

... defined a database schema?

11 / 63

If you don't know what a schema is, you probably haven't. But I'll talk about it later.

... implemented a DBI backend?

12 / 63

If you don't know what a DBI backend is, you probably haven't.

 ↔ 
13 / 63

Thanks. Today's presentation will be about connecting relational databases to R using the DBI package. I'd like to make this presentation useful for you, no matter if you're new to databases or if you use them daily. I'm going to start with some basic concepts, but I'm also including more advanced examples.

  • storage for a set of rectangular tables
    • rows and columns, atomic columns
    • similar to data frames
  • "relational": some tables have column with the same meaning (or name), rows from different tables with the same value in these columns are related
  • databases store their data on disk, optimized for retrieval without reading everything into memory
    • Postgres, MySQL/MariaDB, SQLite (open-source)
    • Microsoft SQL Server, Oracle (commercial)
  • No non-relational: Mongo DB or ElasticSearch, see nodbi by Scott
  • No other solutions such as the RODBC package

Connect to a Postgres database

library(RPostgres)
con <- dbConnect(Postgres())
dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi
## 1 Hello, world!
dbDisconnect(con)
14 / 63

This talk will use a lot of code examples, please bear with me.

  • Explain code line by line
  • Query string: SQL

Connect to a MariaDB database

library(RMariaDB)
con <- dbConnect(MariaDB())
dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi
## 1 Hello, world!
dbDisconnect(con)
15 / 63

Mention SQL dialects

Connect to a SQLite database

library(RSQLite)
con <- dbConnect(SQLite())
dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi
## 1 Hello, world!
dbDisconnect(con)
16 / 63

DataBase Interface

library(DBI)
con <- dbConnect(RSQLite::SQLite())
# con <- dbConnect(RMariaDB::MariaDB())
# con <- dbConnect(RPostgres::Postgres())
dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi
## 1 Hello, world!
dbDisconnect(con)
17 / 63

Not only syntactic similarity

Attach DBI instead of individual database packages, only specify database in dbConnect()

Use only functions from DBI to work with the database, only refer to database backends when connecting.

  • S4 classes and methods
  • method = function defined in DBI
18 / 63

What is a DBI backend?

More packages involved, omitted here

ODBC is covered as part of this framework

Extra packages

  • DBI: Function definition
  • Can use database-specific backends
19 / 63

DBI has been around for ages, almost as old as R itself.

Interfaces: syntax vs. semantics

20 / 63

DBI 0.x had the "syntax" bit right

Subtle differences in semantics between packages

  • Data types
    • 64-bit integers
  • Character encoding
  • Time zones
  • Behavior in corner cases (zero rows)
  • Optional syntax
21 / 63

DBI specification

Implementers

Test suite

Clear expectations

Boilerplate

Users

Interchangeable backends

Reliable behavior

Precise documentation

 

 

22 / 63
  • Simplify user's life

    • Only learn one interface
  • Simplify implementer's life

  • We can't force DBI backends, but we provide a test suite

  • Project funded by R Consortium

  • First two parts done
23 / 63
  • Walk through DBI's features
    • Pointing to new features
  • Showcase interesting developments

R's DataBase Interface

connect/disconnect

discover

read/write

query/execute

quote/parametrize

transact

24 / 63

Connecting

via GIPHY

25 / 63

First step when working with a database

Can be frustrating when things don't work, but needs to be done only once

Only thing that is backend-specific: connection parameters

library(DBI)
con_sqlite <- dbConnect(
RSQLite::SQLite(),
path = "..."
)
con_mariadb <- dbConnect(
RMariaDB::MariaDB(),
default.file = "..."
)
con_postgres <- dbConnect(
RPostgres::Postgres(),
host = "...",
port = "..."
)
dbDisconnect(con_postgres)
dbDisconnect(con_mariadb)
dbDisconnect(con_sqlite)
26 / 63
27 / 63

Discovering

via GIPHY

28 / 63
library(DBI)
con <- dbConnect(RPostgres::Postgres())
dbListTables(con)
## [1] "geography_columns" "geometry_columns" "spatial_ref_sys"
## [4] "raster_columns" "raster_overviews" "topology"
## [7] "layer" "geom_test" "temp"
29 / 63
dbListObjects(con)
## table is_prefix
## 1 <Id> table = geography_columns FALSE
## 2 <Id> table = geometry_columns FALSE
## 3 <Id> table = spatial_ref_sys FALSE
## 4 <Id> table = raster_columns FALSE
## 5 <Id> table = raster_overviews FALSE
## 6 <Id> table = topology FALSE
## 7 <Id> table = layer FALSE
## 8 <Id> table = geom_test FALSE
## 9 <Id> table = temp FALSE
## 10 <Id> schema = topology TRUE
## 11 <Id> schema = my_schema TRUE
## 12 <Id> schema = information_schema TRUE
## 13 <Id> schema = pg_catalog TRUE
## 14 <Id> schema = nycflights13 TRUE
## 15 <Id> schema = public TRUE
30 / 63

Schema: like a directory in the file system

flights_obj <- dbListObjects(con, Id(schema = "nycflights13"))
flights_obj
## table is_prefix
## 1 <Id> schema = nycflights13, table = airlines FALSE
## 2 <Id> schema = nycflights13, table = airports FALSE
## 3 <Id> schema = nycflights13, table = flights FALSE
## 4 <Id> schema = nycflights13, table = planes FALSE
## 5 <Id> schema = nycflights13, table = weather FALSE
flights_obj$table[[3]]
## <Id> schema = nycflights13, table = flights
31 / 63
32 / 63

Reading

via GIPHY

33 / 63
spatial_ref_sys <- dbReadTable(con, "spatial_ref_sys")
class(spatial_ref_sys)
## [1] "data.frame"
tibble::as_tibble(spatial_ref_sys)
## # A tibble: 5,435 x 5
## srid auth_name auth_srid srtext proj4text
## <int> <chr> <int> <chr> <chr>
## 1 3819 EPSG 3819 "GEOGCS[\"HD1909… "+proj=longlat +e…
## 2 3821 EPSG 3821 "GEOGCS[\"TWD67\… "+proj=longlat +e…
## 3 3824 EPSG 3824 "GEOGCS[\"TWD97\… "+proj=longlat +e…
## 4 3889 EPSG 3889 "GEOGCS[\"IGRS\"… "+proj=longlat +e…
## 5 3906 EPSG 3906 "GEOGCS[\"MGI 19… "+proj=longlat +e…
## 6 4001 EPSG 4001 "GEOGCS[\"Unknow… "+proj=longlat +e…
## 7 4002 EPSG 4002 "GEOGCS[\"Unknow… "+proj=longlat +e…
## 8 4003 EPSG 4003 "GEOGCS[\"Unknow… "+proj=longlat +e…
## 9 4004 EPSG 4004 "GEOGCS[\"Unknow… "+proj=longlat +e…
## 10 4005 EPSG 4005 "GEOGCS[\"Unknow… "+proj=longlat +a…
## # ... with 5,425 more rows
34 / 63

This presentation is a "pure" presentation

flights <- dbReadTable(
con,
Id(schema = "nycflights13", table = "flights")
)
tibble::as_tibble(flights)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables:
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
35 / 63

Querying

via GIPHY

36 / 63
query <- 'SELECT * FROM "nycflights13"."flights" WHERE month = 1'
flights <- dbGetQuery(con, query)
dim(flights)
## [1] 27004 19
37 / 63

Quoting

via GIPHY

38 / 63

Quoting identifiers and literals

Dynamic SQL

Evil code!

month <- 1
query <- paste0(
'SELECT * FROM "nycflights13"."flights" WHERE month = ',
month
)
cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE month = 1
39 / 63
month <- 1
query <- paste0(
'SELECT * FROM "nycflights13"."flights" WHERE month = ',
dbQuoteLiteral(con, month)
)
cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE month = 1::float8
40 / 63
month_var <- "month"
month <- 1
query <- paste0(
'SELECT * FROM "nycflights13"."flights" WHERE ',
month_var,
' = ',
dbQuoteLiteral(con, month)
)
cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE month = 1::float8
41 / 63
month_var <- "month"
month <- 1
query <- paste0(
'SELECT * FROM "nycflights13"."flights" WHERE ',
dbQuoteIdentifier(con, month_var),
' = ',
dbQuoteLiteral(con, month)
)
cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE "month" = 1::float8
42 / 63
table_name <- Id(schema = "nycflights13", table = "flights")
month_var <- "month"
month <- 1
query <- paste0(
'SELECT * FROM ',
dbQuoteIdentifier(con, table_name),
' WHERE ',
dbQuoteIdentifier(con, month_var),
' = ',
dbQuoteLiteral(con, month)
)
cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE "month" = 1::float8
flights <- dbGetQuery(con, query)
dim(flights)
## [1] 27004 19
43 / 63

Much more interesting when quoting strings, datetime, ...

table_name <- Id(schema = "nycflights13", table = "flights")
month_var <- "month"
month <- 1
query <- paste0(
'SELECT * FROM ',
dbQuoteIdentifier(con, table_name),
' WHERE ',
dbQuoteIdentifier(con, month_var),
' = $1'
)
cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE "month" = $1
flights <- dbGetQuery(
con,
query,
params = list(1)
)
dim(flights)
## [1] 27004 19
44 / 63

via GIPHY

45 / 63
library(dplyr)
library(dbplyr)
src <- src_dbi(con)
flights <- tbl(con, ident_q('"nycflights13"."flights"'))
flights
## # Source: table<"nycflights13"."flights"> [?? x 19]
## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with more rows, and 12 more variables:
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
46 / 63
flights %>%
select(year, month, day, ends_with("delay")) %>%
filter(month == 2)
## # Source: lazy query [?? x 5]
## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 2 1 -4 4
## 2 2013 2 1 -5 -4
## 3 2013 2 1 -3 8
## 4 2013 2 1 -8 -10
## 5 2013 2 1 0 9
## 6 2013 2 1 -8 -1
## 7 2013 2 1 -8 9
## 8 2013 2 1 -8 -14
## 9 2013 2 1 -7 18
## 10 2013 2 1 -7 -4
## # … with more rows
47 / 63
flights %>%
group_by(year, month, day) %>%
summarize(mean(arr_delay))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## # Source: lazy query [?? x 4]
## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]
## # Groups: year, month
## year month day `mean(arr_delay)`
## <int> <int> <int> <dbl>
## 1 2013 1 1 12.7
## 2 2013 1 2 12.7
## 3 2013 1 3 5.73
## 4 2013 1 4 -1.93
## 5 2013 1 5 -1.53
## 6 2013 1 6 4.24
## 7 2013 1 7 -4.95
## 8 2013 1 8 -3.23
## 9 2013 1 9 -0.264
## 10 2013 1 10 -5.90
## # … with more rows
48 / 63
flights %>%
group_by(year, month, day) %>%
summarize(mean_delay = mean(arr_delay, na.rm = TRUE))
## # Source: lazy query [?? x 4]
## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]
## # Groups: year, month
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 12.7
## 2 2013 1 2 12.7
## 3 2013 1 3 5.73
## 4 2013 1 4 -1.93
## 5 2013 1 5 -1.53
## 6 2013 1 6 4.24
## 7 2013 1 7 -4.95
## 8 2013 1 8 -3.23
## 9 2013 1 9 -0.264
## 10 2013 1 10 -5.90
## # … with more rows
49 / 63
flights %>%
group_by(year, month, day) %>%
summarize(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
sql_render()
## <SQL> SELECT "year", "month", "day", AVG("arr_delay") AS "mean_delay"
## FROM "nycflights13"."flights"
## GROUP BY "year", "month", "day"
50 / 63
flights %>%
group_by(year, month, day) %>%
summarize(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
collect()
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 2 9 6.64
## 2 2013 5 18 -10.7
## 3 2013 8 20 -8.21
## 4 2013 2 24 1.09
## 5 2013 4 27 -12.0
## 6 2013 1 31 32.6
## 7 2013 11 15 -4.68
## 8 2013 11 4 3.46
## 9 2013 10 19 -1.08
## 10 2013 12 19 9.01
## # … with 355 more rows
51 / 63
model <- lm(arr_delay ~ factor(origin) + dep_delay, nycflights13::flights)
summary(model)
##
## Call:
## lm(formula = arr_delay ~ factor(origin) + dep_delay, data = nycflights13::flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.429 -11.042 -1.684 8.804 202.748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.1975535 0.0539274 -114.924 < 2e-16 ***
## factor(origin)JFK -0.5112923 0.0758116 -6.744 1.54e-11 ***
## factor(origin)LGA 1.4919371 0.0773890 19.278 < 2e-16 ***
## dep_delay 1.0196879 0.0007865 1296.428 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.01 on 327342 degrees of freedom
## (9430 observations deleted due to missingness)
## Multiple R-squared: 0.8372, Adjusted R-squared: 0.8372
## F-statistic: 5.612e+05 on 3 and 327342 DF, p-value: < 2.2e-16
52 / 63
library(tidypredict)
sql <- tidypredict_sql(model, con)
sql
## <SQL> -6.19755349842104 + (CASE WHEN ("origin" = 'JFK') THEN (1.0) WHEN NOT("origin" = 'JFK') THEN (0.0) END) * (-0.511292266035181) + (CASE WHEN ("origin" = 'LGA') THEN (1.0) WHEN NOT("origin" = 'LGA') THEN (0.0) END) * (1.49193714231315) + ("dep_delay") * (1.01968794076906)
cat(strwrap(sql), sep = "\n")
## -6.19755349842104 + (CASE WHEN ("origin" = 'JFK') THEN
## (1.0) WHEN NOT("origin" = 'JFK') THEN (0.0) END) *
## (-0.511292266035181) + (CASE WHEN ("origin" = 'LGA') THEN
## (1.0) WHEN NOT("origin" = 'LGA') THEN (0.0) END) *
## (1.49193714231315) + ("dep_delay") * (1.01968794076906)
53 / 63

rquery   Alternative for generating SQL queries

dbx   Useful DML operations (update, upsert, ...)

sqlr   Database-agnostic DDL

arkdb   Database-agnostic import and export

sf   (Support for spatial databases)

bigrquery, civis, MonetDBLite, odbc, RClickhouse, ... (Other backends)

... ?

54 / 63

Consistent

Interface to

Work with databases in R

55 / 63

Thank contributors!

FIXME: Word cloud

56 / 63

via GIPHY

57 / 63

Become a contributor today!

con <- RPostgres::Postgres()@connect()
con@get_query("SELECT 'Hello, world!' AS hi")
58 / 63
60 / 63

Writing and executing

61 / 63
dbWriteTable(con, "spatial_ref_sys", spatial_ref_sys)
dbCreateTable(con, "spatial_ref_sys", spatial_ref_sys)
dbAppendTable(con, "spatial_ref_sys", spatial_ref_sys)
62 / 63
dbExecute(con, "DROP TABLE ...")
dbBegin(con)
dbExecute(con, "UPDATE TABLE ...")
dbExecute(con, "INSERT TABLE ...")
dbCommit(con)
dbWithTransaction(
con,
{
dbExecute(con, "UPDATE TABLE ...")
dbExecute(con, "INSERT TABLE ...")
}
)
63 / 63

(Commercial)

2 / 63
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow