dokstok

Dokstok is an R package which contains utility functions to work more easily with PL/R. PL/R is a procedural language which runs on top of PostgreSQL/GreenPlum/Apache HAWQ.

Why

This R package was set up with the following things in mind

  • Simplification of database transfer & using copy to speed up transfer data in and out to the database
  • Simplify storing any R objects inside a PostgreSQL/Greenplum/HAWQ database. So that
    • multiple users can access R objects from 1 place instead of communicating over a shared network drive
    • as R objects are stored inside the database, R objects can be backed up by standard database procedures
  • R scripts, R code and functions can be launched at the database server from your local R session, leveraging the capacity of the database environment. The results can be returned to your local R session or kept in the database.
  • R models which are developped locally can be stored in the database and easily deployed as stored procedures. Functionality exists to make this transition more easy.
  • As PL/R functionality works for PostgreSQL, GreenPlum and Apache HAWQ, one can easily scale out.

Current features

  • Functions to store any R objects inside the database. dokstok, dokstok_pull, dokstok_ls, dokstok_rm
  • PL/R functions to evaluate R code inside the database directly: plr_eval
  • Easy functions to work with the database: dbfetch, dbquery, db_create_table, list_databases, list_schema, list_tables
  • Fast reading and writing using COPY copy_in, copy_to
  • Easy connection setup: dokstok_connect, dokstok_defaultconnection, dokstok_disconnect. Set once use everywhere.

Examples

Example on storing objects inside the database

dokstok_connect()
## Save model inside database
mymodel <- lm(Sepal.Length ~ Sepal.Width, data = iris)
obj <- dokstok(mymodel)
obj

## See what is in the database + pull the object back locally
dokstok_ls()
m <- dokstok_pull(obj)
summary(m)

Example on running an R script at the database server

myfun <- function(){
  result <- list()
  result$pid <- Sys.getpid()
  result$liblocs = .libPaths()
  result$pkgs = rownames(installed.packages())
  result$searchpath = search()
  result$env = Sys.getenv()
  result$ls = ls(envir = .GlobalEnv)
  result
}
plr_eval(FUN=myfun, pull = TRUE)

Setup

  • Make sure you install the PL/R extension on the database environment
## E.g. with PostgreSQL
sudo apt-get update
sudo apt-get install postgresql-9.3-plr
  • Install the R package locally and on the database server
install.packages(c("DBI", "RPostgreSQL", "digest", "jsonlite", "data.table"))
install.packages("plr.utils", repos = "http://www.datatailor.be/rcube", type = "source")
  • Start using the package

Create an environment variable DOKSTOK_CON with the path to a file where to store credentials

## You can also set the default connection which will be use when connecting without arguments
credentials_file <- tempfile(pattern = "dokstok", tmpdir = getwd(), fileext = "")
Sys.setenv(DOKSTOK_CON = credentials_file)
dokstok_defaultconnection(
  dbname = "name_of_the_db", 
  host = "localhost", 
  user = "name_of_the_user", 
  password = "password_of_the_user", 
  port = 5432)
  • Setup PL/R extension from R

The following will CREATE EXTENSION plr, create stored procedures, plr_host, plr_require, plr_as_bytea and plr_eval

plrutils_start(fresh = TRUE)
dokstok_init(table='robjects', schema='rstore')

Support

Need support? Contact BNOSAC: http://www.bnosac.be