streaming machine learning with RMOA: stream_in > train > predict

We will be showcasing our RMOA package at the next R User conference in Aalborg.
For the R users who are unfamiliar with streaming modelling and want to be ahead of the Gartner Hype cycle or want to evaluate existing streaming machine learning models, RMOA allows to build, run and evaluate streaming classification models which are built in MOA (Massive Online Learning).
For an introduction to RMOA and MOA and the type of machine learning models which are possible in MOA - see our previous blog post or scroll through our blog page.

In this example below, we showcase the RMOA package by using streaming JSON data which can come from whatever noSQL database that spits out json. For this example, package jsonlite provides a nice stream_in function (an example is shown here) which handles streaming json data. Plugging in streaming machine learning models with RMOA is a breeze.

datastream

Let's dive into the R code immediately where we show how to run, build and evaluate a streaming boosted classification model.

require(jsonlite)
require(data.table)
require(RMOA)
require(ROCR)
##
## Use a dataset from Jeroen Ooms available at jeroenooms.github.io/data/diamonds.json
##
myjsondataset <- url("http://jeroenooms.github.io/data/diamonds.json")
datatransfo <- function(x){
  ## Setting the target to predict
  x$target <- factor(ifelse(x$cut == "Very Good", "Very Good", "Other"), levels = c("Very Good", "Other"))
  ## Making sure the levels are the same across all streaming chunks
  x$color <- factor(x$color, levels = c("D", "E", "F", "G", "H", "I", "J"))
  x  
}

##
## Read 100 lines of an example dataset to see how it looks like
##
x <- readLines(myjsondataset, n = 100, encoding = "UTF-8")
x <- rbindlist(lapply(x, fromJSON))
x <- datatransfo(x)
str(x)

######################################
## Boosted streaming classification
##   - set up the boosting options
######################################
ctrl <- MOAoptions(model = "OCBoost", randomSeed = 123456789, ensembleSize = 25,
                   smoothingParameter = 0.5)
mymodel <- OCBoost(control = ctrl)
mymodel
## Train an initial model on 100 rows of the data
myboostedclassifier <- trainMOA(model = mymodel, 
         formula = target ~ color + depth + x + y + z,
         data = datastream_dataframe(x))

## Update the model iteratively with streaming data
stream_in(
  con = myjsondataset,
  handler = function(x){
    x <- datatransfo(x)
    ## Update the trained model with the new chunks
    myboostedclassifier <- trainMOA(model = myboostedclassifier$model, 
             formula = target ~ color + depth + x + y + z,
             data = datastream_dataframe(x), 
             reset = FALSE) ## do not reset what the model has learned already
  },
  pagesize = 500)

## Do some prediction to test the model
predict(myboostedclassifier, x)
table(sprintf("Reality: %s", x$target),
      sprintf("Predicted: %s", predict(myboostedclassifier, x)))

## Do a streaming prediction
stream_in(con = myjsondataset,
          handler = function(x){
            x <- datatransfo(x)
            myprediction <- predict(myboostedclassifier, x)
            ## Basic evaluation by extracting accuracy
            print(round(sum(myprediction == x$target) / length(myprediction), 2))
          },
          pagesize = 100)

For more information on RMOA or streaming modelling, get into contact.