dokstok - enterprise platform for storing and sharing predictive models

Dokstok is an tool which allows you to store your R models and any R object inside a PostgreSQL / Hadoop database such that these can be easily reused inside the organisation. Storing models inside a database allows you to easily have backup procedures, to trace changes to models while allowing everyone to share the models and to use the models while doing integrated predictive modelling. dokstok allows data scientists to store models inside a traditional PostgreSQL database or a NoSQL Hadoop storage backend, either on premise or in the cloud.

Request for a trial proof of concept

dokstok features

 
 

Store any R models / objects / functions

Easily store any R object inside your database with a simple dokstok_pull and dokstok_push. Allowing you to share models across the organisation inside a database instead of on a file system.

 

Big Data and small data

dokstok currently works with either PostgreSQL as the backend, Greenplum as well as Apache HAWQ, allowing users to easily share any R object either in a regular SQL database or a Big Data Hadoop distributed store.

 

Use the database for security and backup

Your R objects which are stored in the database get the security settings of the database. Use database procedures to easily backup all your R data.

 

Integrated predictive modelling

Dokstok is extended with extra facilitations for R users in order to work more easily with PL/R. This allows R users to easily use the models The tool goes beyond standard text mining output by visualising words / networks of words / networks of topics using interactive visualisations.

 

Launch R code locally, run it on the database server

By using dokstok, functionalities are provided which allow R users to send R code locally to be run inside the database environment. Hence leveraging the shared high performance database architecture and limiting data movements and delays due to copying data to your local desktop.

 

Open Source

Dokstok is open source. BNOSAC supports in setting up the tool and gives training on how to efficiently use it.

dokstok

Dokstok is an R package which contains utility functions to work more easily with PL/R. PL/R is a procedural language which runs on top of PostgreSQL/GreenPlum/Apache HAWQ.

Why

This R package was set up with the following things in mind

  • Simplification of database transfer & using copy to speed up transfer data in and out to the database
  • Simplify storing any R objects inside a PostgreSQL/Greenplum/HAWQ database. So that
    • multiple users can access R objects from 1 place instead of communicating over a shared network drive
    • as R objects are stored inside the database, R objects can be backed up by standard database procedures
  • R scripts, R code and functions can be launched at the database server from your local R session, leveraging the capacity of the database environment. The results can be returned to your local R session or kept in the database.
  • R models which are developped locally can be stored in the database and easily deployed as stored procedures. Functionality exists to make this transition more easy.
  • As PL/R functionality works for PostgreSQL, GreenPlum and Apache HAWQ, one can easily scale out.

Current features

  • Functions to store any R objects inside the database. dokstok, dokstok_pull, dokstok_ls, dokstok_rm
  • PL/R functions to evaluate R code inside the database directly: plr_eval
  • Easy functions to work with the database: dbfetch, dbquery, db_create_table, list_databases, list_schema, list_tables
  • Fast reading and writing using COPY copy_in, copy_to
  • Easy connection setup: dokstok_connect, dokstok_defaultconnection, dokstok_disconnect. Set once use everywhere.

Examples

Example on storing objects inside the database

dokstok_connect()
## Save model inside database
mymodel <- lm(Sepal.Length ~ Sepal.Width, data = iris)
obj <- dokstok(mymodel)
obj

## See what is in the database + pull the object back locally
dokstok_ls()
m <- dokstok_pull(obj)
summary(m)

Example on running an R script at the database server

myfun <- function(){
  result <- list()
  result$pid <- Sys.getpid()
  result$liblocs = .libPaths()
  result$pkgs = rownames(installed.packages())
  result$searchpath = search()
  result$env = Sys.getenv()
  result$ls = ls(envir = .GlobalEnv)
  result
}
plr_eval(FUN=myfun, pull = TRUE)

Setup

  • Make sure you install the PL/R extension on the database environment
## E.g. with PostgreSQL
sudo apt-get update
sudo apt-get install postgresql-9.3-plr
  • Install the R package locally and on the database server
install.packages(c("DBI", "RPostgreSQL", "digest", "jsonlite", "data.table"))
install.packages("plr.utils", repos = "http://www.datatailor.be/rcube", type = "source")
  • Start using the package

Create an environment variable DOKSTOK_CON with the path to a file where to store credentials

## You can also set the default connection which will be use when connecting without arguments
credentials_file <- tempfile(pattern = "dokstok", tmpdir = getwd(), fileext = "")
Sys.setenv(DOKSTOK_CON = credentials_file)
dokstok_defaultconnection(
  dbname = "name_of_the_db", 
  host = "localhost", 
  user = "name_of_the_user", 
  password = "password_of_the_user", 
  port = 5432)
  • Setup PL/R extension from R

The following will CREATE EXTENSION plr, create stored procedures, plr_host, plr_require, plr_as_bytea and plr_eval

plrutils_start(fresh = TRUE)
dokstok_init(table='robjects', schema='rstore')

Support

Need support? Contact BNOSAC: http://www.bnosac.be