R users in Swiss public adminis-tration, or institutions with ties to it
Thursday, September 20, 2018
5:15 PM to 7:15 PM
State Secretariat for Economic Affairs
SECO · Holzikofenweg 36 · Bern
Hi! Welcome everybode to my presentation.
Today I'll talk about R and databases.
Before we start, I'd like to conduct a short survey.
A separate server whose only purpose is to store and retrieve data.
... or "an Ess-Cue-Ell query"?
to retrieve some data
Did your code use dbConnect()
?
If you don't know what a schema is, you probably haven't. But I'll talk about it later.
If you don't know what a DBI backend is, you probably haven't.
![]() |
↔
|
|
Thanks. Today's presentation will be about connecting relational databases to R using the DBI package. I'd like to make this presentation useful for you, no matter if you're new to databases or if you use them daily. I'm going to start with some basic concepts, but I'm also including more advanced examples.
library(RPostgres)con <- dbConnect(Postgres())dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi## 1 Hello, world!
dbDisconnect(con)
This talk will use a lot of code examples, please bear with me.
library(RMariaDB)con <- dbConnect(MariaDB())dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi## 1 Hello, world!
dbDisconnect(con)
Mention SQL dialects
library(RSQLite)con <- dbConnect(SQLite())dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi## 1 Hello, world!
dbDisconnect(con)
library(DBI)con <- dbConnect(RSQLite::SQLite())# con <- dbConnect(RMariaDB::MariaDB())# con <- dbConnect(RPostgres::Postgres())dbGetQuery(con, "SELECT 'Hello, world!' AS hi")
## hi## 1 Hello, world!
dbDisconnect(con)
Not only syntactic similarity
Attach DBI instead of individual database packages, only specify database in dbConnect()
Use only functions from DBI to work with the database, only refer to database backends when connecting.
What is a DBI backend?
More packages involved, omitted here
ODBC is covered as part of this framework
Extra packages
DBI has been around for ages, almost as old as R itself.
DBI 0.x had the "syntax" bit right
Subtle differences in semantics between packages
Simplify user's life
Simplify implementer's life
We can't force DBI backends, but we provide a test suite
Project funded by R Consortium
First step when working with a database
Can be frustrating when things don't work, but needs to be done only once
Only thing that is backend-specific: connection parameters
library(DBI)con_sqlite <- dbConnect( RSQLite::SQLite(), path = "...")con_mariadb <- dbConnect( RMariaDB::MariaDB(), default.file = "...")con_postgres <- dbConnect( RPostgres::Postgres(), host = "...", port = "...")dbDisconnect(con_postgres)dbDisconnect(con_mariadb)dbDisconnect(con_sqlite)
library(DBI)con <- dbConnect(RPostgres::Postgres())dbListTables(con)
## [1] "geography_columns" "geometry_columns" "spatial_ref_sys" ## [4] "raster_columns" "raster_overviews" "topology" ## [7] "layer" "geom_test" "temp"
dbListObjects(con)
## table is_prefix## 1 <Id> table = geography_columns FALSE## 2 <Id> table = geometry_columns FALSE## 3 <Id> table = spatial_ref_sys FALSE## 4 <Id> table = raster_columns FALSE## 5 <Id> table = raster_overviews FALSE## 6 <Id> table = topology FALSE## 7 <Id> table = layer FALSE## 8 <Id> table = geom_test FALSE## 9 <Id> table = temp FALSE## 10 <Id> schema = topology TRUE## 11 <Id> schema = my_schema TRUE## 12 <Id> schema = information_schema TRUE## 13 <Id> schema = pg_catalog TRUE## 14 <Id> schema = nycflights13 TRUE## 15 <Id> schema = public TRUE
Schema: like a directory in the file system
flights_obj <- dbListObjects(con, Id(schema = "nycflights13"))flights_obj
## table is_prefix## 1 <Id> schema = nycflights13, table = airlines FALSE## 2 <Id> schema = nycflights13, table = airports FALSE## 3 <Id> schema = nycflights13, table = flights FALSE## 4 <Id> schema = nycflights13, table = planes FALSE## 5 <Id> schema = nycflights13, table = weather FALSE
flights_obj$table[[3]]
## <Id> schema = nycflights13, table = flights
spatial_ref_sys <- dbReadTable(con, "spatial_ref_sys")class(spatial_ref_sys)
## [1] "data.frame"
tibble::as_tibble(spatial_ref_sys)
## # A tibble: 5,435 x 5## srid auth_name auth_srid srtext proj4text ## <int> <chr> <int> <chr> <chr> ## 1 3819 EPSG 3819 "GEOGCS[\"HD1909… "+proj=longlat +e…## 2 3821 EPSG 3821 "GEOGCS[\"TWD67\… "+proj=longlat +e…## 3 3824 EPSG 3824 "GEOGCS[\"TWD97\… "+proj=longlat +e…## 4 3889 EPSG 3889 "GEOGCS[\"IGRS\"… "+proj=longlat +e…## 5 3906 EPSG 3906 "GEOGCS[\"MGI 19… "+proj=longlat +e…## 6 4001 EPSG 4001 "GEOGCS[\"Unknow… "+proj=longlat +e…## 7 4002 EPSG 4002 "GEOGCS[\"Unknow… "+proj=longlat +e…## 8 4003 EPSG 4003 "GEOGCS[\"Unknow… "+proj=longlat +e…## 9 4004 EPSG 4004 "GEOGCS[\"Unknow… "+proj=longlat +e…## 10 4005 EPSG 4005 "GEOGCS[\"Unknow… "+proj=longlat +a…## # ... with 5,425 more rows
This presentation is a "pure" presentation
flights <- dbReadTable( con, Id(schema = "nycflights13", table = "flights"))tibble::as_tibble(flights)
## # A tibble: 336,776 x 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 12 more variables:## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>
query <- 'SELECT * FROM "nycflights13"."flights" WHERE month = 1'flights <- dbGetQuery(con, query)dim(flights)
## [1] 27004 19
Quoting identifiers and literals
Dynamic SQL
Evil code!
month <- 1query <- paste0( 'SELECT * FROM "nycflights13"."flights" WHERE month = ', month)cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE month = 1
month <- 1query <- paste0( 'SELECT * FROM "nycflights13"."flights" WHERE month = ', dbQuoteLiteral(con, month))cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE month = 1::float8
month_var <- "month"month <- 1query <- paste0( 'SELECT * FROM "nycflights13"."flights" WHERE ', month_var, ' = ', dbQuoteLiteral(con, month))cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE month = 1::float8
month_var <- "month"month <- 1query <- paste0( 'SELECT * FROM "nycflights13"."flights" WHERE ', dbQuoteIdentifier(con, month_var), ' = ', dbQuoteLiteral(con, month))cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE "month" = 1::float8
table_name <- Id(schema = "nycflights13", table = "flights")month_var <- "month"month <- 1query <- paste0( 'SELECT * FROM ', dbQuoteIdentifier(con, table_name), ' WHERE ', dbQuoteIdentifier(con, month_var), ' = ', dbQuoteLiteral(con, month))cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE "month" = 1::float8
flights <- dbGetQuery(con, query)dim(flights)
## [1] 27004 19
Much more interesting when quoting strings, datetime, ...
table_name <- Id(schema = "nycflights13", table = "flights")month_var <- "month"month <- 1query <- paste0( 'SELECT * FROM ', dbQuoteIdentifier(con, table_name), ' WHERE ', dbQuoteIdentifier(con, month_var), ' = $1')cat(query)
## SELECT * FROM "nycflights13"."flights" WHERE "month" = $1
flights <- dbGetQuery( con, query, params = list(1))dim(flights)
## [1] 27004 19
library(dplyr)library(dbplyr)src <- src_dbi(con)flights <- tbl(con, ident_q('"nycflights13"."flights"'))flights
## # Source: table<"nycflights13"."flights"> [?? x 19]## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with more rows, and 12 more variables:## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>
flights %>% select(year, month, day, ends_with("delay")) %>% filter(month == 2)
## # Source: lazy query [?? x 5]## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]## year month day dep_delay arr_delay## <int> <int> <int> <dbl> <dbl>## 1 2013 2 1 -4 4## 2 2013 2 1 -5 -4## 3 2013 2 1 -3 8## 4 2013 2 1 -8 -10## 5 2013 2 1 0 9## 6 2013 2 1 -8 -1## 7 2013 2 1 -8 9## 8 2013 2 1 -8 -14## 9 2013 2 1 -7 18## 10 2013 2 1 -7 -4## # … with more rows
flights %>% group_by(year, month, day) %>% summarize(mean(arr_delay))
## Warning: Missing values are always removed in SQL.## Use `AVG(x, na.rm = TRUE)` to silence this warning
## # Source: lazy query [?? x 4]## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]## # Groups: year, month## year month day `mean(arr_delay)`## <int> <int> <int> <dbl>## 1 2013 1 1 12.7 ## 2 2013 1 2 12.7 ## 3 2013 1 3 5.73 ## 4 2013 1 4 -1.93 ## 5 2013 1 5 -1.53 ## 6 2013 1 6 4.24 ## 7 2013 1 7 -4.95 ## 8 2013 1 8 -3.23 ## 9 2013 1 9 -0.264## 10 2013 1 10 -5.90 ## # … with more rows
flights %>% group_by(year, month, day) %>% summarize(mean_delay = mean(arr_delay, na.rm = TRUE))
## # Source: lazy query [?? x 4]## # Database: postgres [kirill@/var/run/postgresql:5432/kirill]## # Groups: year, month## year month day mean_delay## <int> <int> <int> <dbl>## 1 2013 1 1 12.7 ## 2 2013 1 2 12.7 ## 3 2013 1 3 5.73 ## 4 2013 1 4 -1.93 ## 5 2013 1 5 -1.53 ## 6 2013 1 6 4.24 ## 7 2013 1 7 -4.95 ## 8 2013 1 8 -3.23 ## 9 2013 1 9 -0.264## 10 2013 1 10 -5.90 ## # … with more rows
flights %>% group_by(year, month, day) %>% summarize(mean_delay = mean(arr_delay, na.rm = TRUE)) %>% sql_render()
## <SQL> SELECT "year", "month", "day", AVG("arr_delay") AS "mean_delay"## FROM "nycflights13"."flights"## GROUP BY "year", "month", "day"
flights %>% group_by(year, month, day) %>% summarize(mean_delay = mean(arr_delay, na.rm = TRUE)) %>% collect()
## # A tibble: 365 x 4## # Groups: year, month [12]## year month day mean_delay## <int> <int> <int> <dbl>## 1 2013 2 9 6.64## 2 2013 5 18 -10.7 ## 3 2013 8 20 -8.21## 4 2013 2 24 1.09## 5 2013 4 27 -12.0 ## 6 2013 1 31 32.6 ## 7 2013 11 15 -4.68## 8 2013 11 4 3.46## 9 2013 10 19 -1.08## 10 2013 12 19 9.01## # … with 355 more rows
model <- lm(arr_delay ~ factor(origin) + dep_delay, nycflights13::flights)summary(model)
## ## Call:## lm(formula = arr_delay ~ factor(origin) + dep_delay, data = nycflights13::flights)## ## Residuals:## Min 1Q Median 3Q Max ## -107.429 -11.042 -1.684 8.804 202.748 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6.1975535 0.0539274 -114.924 < 2e-16 ***## factor(origin)JFK -0.5112923 0.0758116 -6.744 1.54e-11 ***## factor(origin)LGA 1.4919371 0.0773890 19.278 < 2e-16 ***## dep_delay 1.0196879 0.0007865 1296.428 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.01 on 327342 degrees of freedom## (9430 observations deleted due to missingness)## Multiple R-squared: 0.8372, Adjusted R-squared: 0.8372 ## F-statistic: 5.612e+05 on 3 and 327342 DF, p-value: < 2.2e-16
library(tidypredict)sql <- tidypredict_sql(model, con)sql
## <SQL> -6.19755349842104 + (CASE WHEN ("origin" = 'JFK') THEN (1.0) WHEN NOT("origin" = 'JFK') THEN (0.0) END) * (-0.511292266035181) + (CASE WHEN ("origin" = 'LGA') THEN (1.0) WHEN NOT("origin" = 'LGA') THEN (0.0) END) * (1.49193714231315) + ("dep_delay") * (1.01968794076906)
cat(strwrap(sql), sep = "\n")
## -6.19755349842104 + (CASE WHEN ("origin" = 'JFK') THEN## (1.0) WHEN NOT("origin" = 'JFK') THEN (0.0) END) *## (-0.511292266035181) + (CASE WHEN ("origin" = 'LGA') THEN## (1.0) WHEN NOT("origin" = 'LGA') THEN (0.0) END) *## (1.49193714231315) + ("dep_delay") * (1.01968794076906)
Thank contributors!
FIXME: Word cloud
Become a contributor today!
con <- RPostgres::Postgres()@connect()con@get_query("SELECT 'Hello, world!' AS hi")
dbWriteTable(con, "spatial_ref_sys", spatial_ref_sys)dbCreateTable(con, "spatial_ref_sys", spatial_ref_sys)dbAppendTable(con, "spatial_ref_sys", spatial_ref_sys)
dbExecute(con, "DROP TABLE ...")dbBegin(con)dbExecute(con, "UPDATE TABLE ...")dbExecute(con, "INSERT TABLE ...")dbCommit(con)dbWithTransaction( con, { dbExecute(con, "UPDATE TABLE ...") dbExecute(con, "INSERT TABLE ...") })
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |