RMOA: Massive online data stream classifications with R & MOA

For those of you who don't know MOA. MOA stands for Massive On-line Analysis and is an open-source framework that allows to build and run experiments of machine learning or data mining on evolving data streams. The website of MOA (http://moa.cms.waikato.ac.nz) indicates it contains machine learning algorithms for classification, regression, clustering, outlier detection and recommendation engines.
For R users who work with a lot of data or encounter RAM issues when building models on large datasets, MOA and in general data streams have some nice features. Namely:
  1. It uses a limited amount of memory. So this means no RAM issues when building models.
  2. Processes one example at a time, and will run over it only once
  3. Works incrementally - so that a model is directly ready to be used for prediction purposes


Unfortunately it is written in Java and not easily accessible for R users to use. For users mostly interested in clustering, the stream package already facilites this (this blog item gave an example when using ff alongside the stream package). In our day-to-day use cases, classification is a more common request. The stream package only allows to do clustering. So hence the decision to make the classification algorithms of MOA easily available to R users as well. For this the RMOA package was created and is available on github (https://github.com/jwijffels/RMOA).
The current features of RMOA are:
  1. Easy to set up data streams on data in RAM (data.frame/matrix), data in files (csv, delimited, flat table) as well as out-of memory data in an ffdf (ff package).
  2. Easy to set up a MOA classification model 
  3. There are 26 classification models available which range from
    1. Classification Trees (AdaHoeffdingOptionTree, ASHoeffdingTree, DecisionStump, HoeffdingAdaptiveTree, HoeffdingOptionTree, HoeffdingTree, LimAttHoeffdingTree, RandomHoeffdingTree)
    2. Bayes Rule (NaiveBayes, NaiveBayesMultinomial)
    3. Ensemble learning
      • Bagging (LeveragingBag, OzaBag, OzaBagAdwin, OzaBagASHT)
      • Boosting (OCBoost, OzaBoost, OzaBoostAdwin)
      • Stacking (LimAttClassifier)
      • Other (AccuracyUpdatedEnsemble, AccuracyWeightedEnsemble, ADACC, DACC, OnlineAccuracyUpdatedEnsemble, TemporallyAugmentedClassifier, WeightedMajorityAlgorithm)
    4. Active learning (ActiveClassifier)
  4. Easy R-familiar interface to train the model on streaming data with a familiar formula interface as in trainMOA(model, formula, data, subset, na.action = na.exclude, ...)
  5. Easy to predict new data alongside the model as in predict(object, newdata, type = "response", ...)
Feel free to use it and we welcome any feedback in your day-to-day RMOA usage experiences at https://github.com/jwijffels/RMOA in order to improve the package. For more documentation on MOA for R users: see http://jwijffels.github.io/RMOA/
An example of R code which constructs a HoeffdingTree and a boosted set of HoeffdingTrees is shown below.
## Installation from github
install_github("jwijffels/RMOA", subdir="RMOAjars/pkg")
install_github("jwijffels/RMOA", subdir="RMOA/pkg")

## HoeffdingTree example
hdt <- HoeffdingTree(numericEstimator = "GaussianNumericAttributeClassObserver")
## Define a stream - e.g. a stream based on a data.frame
iris <- factorise(iris)
irisdatastream <- datastream_dataframe(data=iris)
## Train the HoeffdingTree on the iris dataset
mymodel <- trainMOA(model = hdt, 
  formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, 
  data = irisdatastream)
## Predict using the HoeffdingTree on the iris dataset
scores <- predict(mymodel, newdata=iris, type="response")
table(scores, iris$Species)
scores <- predict(mymodel, newdata=iris, type="votes")

## Boosted set of HoeffdingTrees
irisdatastream <- datastream_dataframe(data=iris)
mymodel <- OzaBoost(baseLearner = "trees.HoeffdingTree", ensembleSize = 30)
mymodel <- trainMOA(model = mymodel, 
  formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, 
  data = irisdatastream)
## Predict 
scores <- predict(mymodel, newdata=iris, type="response")
table(scores, iris$Species)
scores <- predict(mymodel, newdata=iris, type="votes")

Air quality, weather analysis & latent feature modelling @RBelgium

rbelgiumlogo v1

Within 2 weeks on Thursday, March 20, The RBelgium R user group is holding its next Regular meeting in Brussels for which this is the schedule:

** Analysis and visualisation of climate data from the atmospheric model ALADIN using the Rfa package! (Rozemien De Troch - Onderzoeksdepartement KMI)

** Probabilistic latent feature analysis with the plfm package (Michel Meulders - Centre for Information Management, Modeling and Simulation, KU Leuven@ HUBrussel)
** AiR Quality Monitoring – An alternative way for data analysis and visualization (Spanu Laurent & Lenartz Fabian - Institut scientifique de service public)

For more information about the event follow this link. Feel free to join.

advanced r programming topics + rapache course

Advanced R programming topics

Similarly as last year, BNOSAC is offering the short course on 'Advanced R programming topics' at the Leuven Statistics Research Center (Belgium). 

The course is now part of FLAMES (Flanders Training Network for Methodology and Statistics) and can be found here http://www.flames-statistics.eu/training/advanced-r-programming-topics. Subscription is no longer possible unless you ask kindly to LStat.

RApache and developing web applications with R as backend

As the demand of courses on R is increasing, we are thinking also about giving a course on RApache and developing web applications with R as a backend. This course will allow you to build applications like this one  http://rweb.stat.ucla.edu/lme4/ or this one http://rweb.stat.ucla.edu/ggplot2/.

BNOSAC has quite some (private) business applications running involving this technology stack and would to share with you it's knowledge. If you are interested in these courses which combine javascript, R and RApache, get in contact with us and send a mail by filling out the form at index.php/contact/get-in-touch. The more people interested, the lower the cost of the course ... .

logo flames


Connect R with Myrrix - Mahout & Cloudera's real-time, scalable recommender system

Myrrix is probably more known by java developers and users of Mahout than R users. This is because most of the times java and R developers live in a different community. 

myrrix logo
If you go to the website of Myrrix (http://myrrix.com), you'll find out that it is a large-scale recommender system which is able to build a recommendation model based on Alternating Least Squares. That technique is a pretty good benchmark model if you tune it well enough to get recommendations to your customers.
It has a setup which allows you to build recommendation models with local data and a setup to build a recommender system based on data in Hadoop - be it on CDH or on another Hadoop stack like HDInsights or your own installation.
Very recently, Cloudera has shown the intention to incorporate Myrrix into it's product offering (see this press release) and this is getting quite some attention.
Recommendation engines are one of the techniques in machine learning which get frequent attention although they are not so frequently used as other statistical techniques like classification or regression.
This is because a recommendation engines most of the time require a lot of processing like deciding on which data to use, handling time-based information, handling new products and products which are no longer sold, making sure the model is up-to-date andsoforth. 
When setting up a recommendation engine, business users also want to compare their behaviour to other business-driven or other data-driven logic. In these initial phases of a project, allowing statisticians and data scientists to use their language of choice to communicate with, test and evaluate the recommendation engine is key.
myrrix r
To allow this, we have created an interface between R and Myrrix, containing 2 packages which are currently available on github (https://github.com/jwijffels/Myrrix-R-interface). It allows R users to build, finetune and evaluate the recommendation engine as well as retrieve recommendations. Future users of Cloudera might as well be interested in this, once Myrrix gets incorporated into their product offering.
Myrrix deploys a recommender engine technique based on large, sparse matrix factorization. From input data, it learns a small number of "features" that best explain users' and items' observed interactions. This same basic idea goes by many names in machine learning, like principal component analysist or latent factor analysis. Myrrix uses a modified version of an Alternating Least Squares algorithm to factor matrices. More information can be found here: http://www.slideshare.net/srowen/big-practical-recommendations-with-alternating-least-squares and at the Myrrix website.
So if you are interested in setting up a recommendation engine for your application or if you want to improve your existing recommendation toolkit, contact us.
If you are an R user and only interested in the code on how to build a recommendation model and retrieve recommendations, here it is. The packages will be pushed to CRAN soon.
# To start up building recommendation engines, install the R packages Myrrixjars and Myrrix as follows.

install_github("Myrrix-R-interface", "jwijffels", subdir="/Myrrixjars/pkg")
install_github("Myrrix-R-interface", "jwijffels", subdir="/Myrrix/pkg")

## The following example shows the basic usage on how to use Myrrix to build a local recommendation 
## engine. It uses the audioscrobbler data available on the Myrrix website.

## Download example dataset
inputfile <- file.path(tempdir(), "audioscrobbler-data.subset.csv.gz")
download.file(url="http://dom2bevkhhre1.cloudfront.net/audioscrobbler-data.subset.csv.gz", destfile = inputfile)

## Set hyperparameters
setMyrrixHyperParameters(params=list(model.iterations.max = 2, model.features=10, model.als.lambda=0.1))
x <- getMyrrixHyperParameters(parameters=c("model.iterations.max","model.features","model.als.lambda"))

## Build a model which will be stored in getwd() and ingest the data file into it
recommendationengine <- new("ServerRecommender", localInputDir=getwd())
ingest(recommendationengine, inputfile)

## Get all users/items and score alongside the recommendation model
items <- getAllItemIDs(recommendationengine)
users <- getAllUserIDs(recommendationengine)
estimatePreference(recommendationengine, userID=users[1], itemIDs=items[1:20])
estimatePreference(recommendationengine, userID=users[10], itemIDs=items)
mostPopularItems(recommendationengine, howMany=10L)
recommend(recommendationengine, userID=users[5], howMany=10L)


Popularity bigdata / large data packages in R and ffbase useR presentation

A few weeks ago, Rstudio released it's download logs, showing who downloaded R packages through their CRAN mirror. More info: http://blog.rstudio.org/2013/06/10/rstudio-cran-mirror/

This is very nice information and it can be used to show the popularity of packages with R, which has been done before and criticized also as the RStudio logs might/might not be representative for the download behaviour of all useRs.
As the useR2013 conference has come to an end, one of the topics corporate useRs of R seem to be talking about is how to speed up R and how R handles large data.
Edwin & BNOSAC did their fair share by giving a presentation about the use of ffbase alongside the ff package which can be found here .
When looking at twitter feeds (https://twitter.com/search?q=user2013), there is now Tibco who has it's own R interpreter, there is R inside the JVM, Rcpp, Revolution R, ff/ffbase, R inside Oracle, there is pbdR, pretty quick R (pqR), MPI, R on grids, R with mongo/monet-DB, PL/R, dplyr and useRs made a lot of presentations about how they handled large data in their business setting. It seems like the use of R with large datasets is being more and more accepted in the corporate world - which is a good thing. And we love the diversity!
For R packages which are on CRAN, the Rstudio download logs can be used to show download statistics of the open source bigdata / large data packages which are now on the market (CRAN). 
For this, the logs were downloaded and a number of open source packages which are out-of-memory / bigdata solutions in R were compared with respect to download stats on this mirror.
It seems like by far the most popular package is ff and our own contribution (ffbase) is not doing bad at all (+/- 100 ip addresses downloaded our package per week from the Rstudio CRAN mirror only). 
If you are interested in the code to download the data and get the plot or if you want to compare your own packages, you can use the following code.
## Rstudio logs
input <- list()
input$path <- getwd()
input$path <- "/home/janw/Desktop/ffbaseusage"
input$start <- as.Date('2012-10-01')
input$today <- as.Date('2013-06-10')
input$today <- Sys.Date()-1
input$all_days <- seq(input$start, input$today, by = 'day')
input$all_days <- seq(input$start, input$today, by = 'day')
input$urls <- paste0('http://cran-logs.rstudio.com/', 
                     as.POSIXlt(input$all_days)$year + 1900, '/', input$all_days, '.csv.gz')
## Download
sapply(input$urls, FUN=function(x, path) {
  try(download.file(x, destfile = file.path(path, strsplit(x, "/")[[1]][[5]])))
}, path=input$path)

## Import the data in a csv and put it in 1 ffdf
files <- sort(list.files(input$path, pattern = ".csv.gz$"))
rstudiologs <- NULL
for(file in files){
  con <- gzfile(file.path(input$path, file))
  x <- read.csv(con, header=TRUE, colClasses = c("Date","character","integer", rep("factor", 6), "numeric"))
  x$time <- as.POSIXct(strptime(sprintf("%s %s", x$date, x$time), "%Y-%m-%d %H:%M:%S"))
  rstudiologs <- rbind(rstudiologs, as.ffdf(x))
rstudiologs <- subset(rstudiologs, as.Date(time) >= as.Date("2012-12-31"))
ffsave(rstudiologs, file = file.path(input$path, "rstudiologs"))

tmp <- ffload(file.path(input$path, "rstudiologs"), rootpath = tempdir())
rstudiologs[1:2, ]
packages <- c("ff","ffbase","bigmemory","mmap","filehash","pbdBASE","colbycol","MonetDB.R")
idx <- rstudiologs$package %in% ff(factor(packages))
idx <- ffwhich(idx, idx == TRUE)
mypackages <- rstudiologs[idx, ]
mypackages <- as.data.frame(mypackages)
info <- c("r_version","r_arch","r_os","package","version","country")
mypackages[info] <- apply(mypackages[info], MARGIN=2, as.character)
mypackages <- as.data.table(mypackages)
mypackages$aantal <- 1
mondayofweek <- function(x){
    weekday <- as.integer(format(x, "%w"))
    as.Date(ifelse(weekday == 0, x-6, x-(weekday-1)), origin=Sys.Date()-as.integer(Sys.Date()))
mypackages$date <- mondayofweek(mypackages$date)
byday <- mypackages[, 
                    list(aantal = sum(aantal), 
                         ips = length(unique(ip_id))), 
                    by = list(package, date)]
byday <- subset(byday, date != max(as.character(byday$date)))

byday <- transform(byday, package=reorder(package, byday$ips))
qplot( data=byday, y=ips, x=date, color=reorder(package, -ips, mean), geom="line", size=I(1)
) + labs(x="", y="# unique ip", title="Rstudio logs 2013, downloads/week", color="") + theme_bw()