CRAN mirrors are the backbone to everyday common R usage. They provide the R website and most of the R packages today. Currently there are about 104 official CRAN mirrors. Hosting a CRAN mirror is one step to help the R community and is explained here.
To ease that process, at BNOSAC, we have created a Docker image which sets up a CRAN mirror.
That Docker image can be found and is available for download at the following docker registry: https://registry.hub.docker.com/u/bnosac/cran-mirror
For people who don't know Docker, it is basically a tool which allows developers to containerise an application. In this case, the application is to run a CRAN mirror.
How does it work. 3 steps:
1. Install Docker on your computer or server as explained here, if you haven't done this already.
2. Pull the docker image: docker pull bnosac/cran-mirror
3. Run the CRAN mirror: docker run -p 22:22 -p 80:80 -v /home/bnosac/CRAN/:/var/www/html -d bnosac/cran-mirror
That's it! It started synching CRAN on /home/bnosac/CRAN and will synch every day at 02h30 UTC. You can now go to 0.0.0.0 in your browser or find the ip address where it is running and go to that address in your browser to see the R website (see https://registry.hub.docker.com/u/bnosac/cran-mirror for more info).
Now what can you do with it?
- have a local CRAN mirror in your company
- install.packages("data.table", repos = "mylocalmirror")
- serve the community with another mirror if a closeby mirror is not readily available
This year, BNOSAC offers 2 R courses in cooperation with the Leuven Statistics Research Center. The courses are part of the Leuven STATistics STATe of the Art Training Initiative and are given in Leuven (Belgium).
For R users and data scientists we offer a 2 short courses on R programming & statistical learning. Namely
* Advanced R programming topics (November 3-4, 2014)
* Statistical Machine Learning with R (November 27-28, 2014):
You can download the brochure with the courses here: http://lstat.kuleuven.be/training/HR_BRO_LSTAT_2014-2015.pdf
Interested? Registration can be done here.
Last week, we released the RMOA package at CRAN (http://cran.r-project.org/web/packages/RMOA). It is an R package to allow building streaming classification and regression models on top of MOA.
MOA is the acronym of 'Massive Online Analysis' and it is the most popular open source framework for data stream mining which is being developed at the University of Waikato: http://moa.cms.waikato.ac.nz. Our RMOA package interfaces with MOA version 2014.04 and focusses on building, evaluating and scoring streaming classification & regression models on data streams.
Classification & regression models which are possible through RMOA are:
- Classification trees:
* AdaHoeffdingOptionTree
* ASHoeffdingTree
* DecisionStump
* HoeffdingAdaptiveTree
* HoeffdingOptionTree
* HoeffdingTree
* LimAttHoeffdingTree
* RandomHoeffdingTree
- Bayesian classification:
* NaiveBayes
* NaiveBayesMultinomial
- Active learning classification:
* ActiveClassifier
- Ensemble (meta) classifiers:
* Bagging
+ LeveragingBag
+ OzaBag
+ OzaBagAdwin
+ OzaBagASHT
* Boosting
+ OCBoost
+ OzaBoost
+ OzaBoostAdwin
* Stacking
+ LimAttClassifier
* Other
+ AccuracyUpdatedEnsemble
+ AccuracyWeightedEnsemble
+ ADACC
+ DACC
+ OnlineAccuracyUpdatedEnsemble
+ TemporallyAugmentedClassifier
+ WeightedMajorityAlgorithm
- Regression modelling:
* AMRulesRegressor
* FadingTargetMean
* FIMTDD
* ORTO
* Perceptron
* RandomRules
* SGD (Stochastic Gradient Descent)
* TargetMean
Interfaces are implemented to model data in standard files (csv, txt, delimited), ffdf data (from the ff package), data.frames and matrices.
Documentation of MOA directed towards RMOA users can be found at http://jwijffels.github.io/RMOA
Examples on the use of RMOA can be found in the documentation, on github at https://github.com/jwijffels/RMOA or e.g. by viewing the showcase at http://bnosac.be/index.php/blog/16-rmoa-massive-online-data-stream-classifications-with-r-a-moa
If you need support on building streaming models on top of your large dataset. Get into contact.
For those of you who don't know MOA. MOA stands for
Massive
On-line
Analysis and is an open-source framework that allows to build and run experiments of machine learning or data mining on evolving data streams. The website of MOA (
http://moa.cms.waikato.ac.nz) indicates it contains machine learning algorithms for
classification, regression, clustering, outlier detection and recommendation engines.
For R users who work with a lot of data or encounter RAM issues when building models on large datasets, MOA and in general data streams have some nice features. Namely:
- It uses a limited amount of memory. So this means no RAM issues when building models.
- Processes one example at a time, and will run over it only once
- Works incrementally - so that a model is directly ready to be used for prediction purposes
Unfortunately it is written in Java and not easily accessible for R users to use. For users mostly interested in clustering, the
stream package already facilites this (
this blog item gave an example when using
ff alongside the stream package). In our day-to-day use cases, classification is a more common request. The stream package only allows to do clustering. So hence the decision to make the
classification algorithms of MOA easily available to R users as well. For this the
RMOA package was created and is available on github (
https://github.com/jwijffels/RMOA).
The current features of RMOA are:
- Easy to set up data streams on data in RAM (data.frame/matrix), data in files (csv, delimited, flat table) as well as out-of memory data in an ffdf (ff package).
- Easy to set up a MOA classification model
- There are 26 classification models available which range from
- Classification Trees (AdaHoeffdingOptionTree, ASHoeffdingTree, DecisionStump, HoeffdingAdaptiveTree, HoeffdingOptionTree, HoeffdingTree, LimAttHoeffdingTree, RandomHoeffdingTree)
- Bayes Rule (NaiveBayes, NaiveBayesMultinomial)
- Ensemble learning
- Bagging (LeveragingBag, OzaBag, OzaBagAdwin, OzaBagASHT)
- Boosting (OCBoost, OzaBoost, OzaBoostAdwin)
- Stacking (LimAttClassifier)
- Other (AccuracyUpdatedEnsemble, AccuracyWeightedEnsemble, ADACC, DACC, OnlineAccuracyUpdatedEnsemble, TemporallyAugmentedClassifier, WeightedMajorityAlgorithm)
- Active learning (ActiveClassifier)
- Easy R-familiar interface to train the model on streaming data with a familiar formula interface as in
trainMOA(model, formula, data, subset, na.action = na.exclude, ...)
- Easy to predict new data alongside the model as in
predict(object, newdata, type = "response", ...)
An example of R code which constructs a HoeffdingTree and a boosted set of HoeffdingTrees is shown below.
##
## Installation from github
##
library(devtools)
install.packages("ff")
install.packages("rJava")
install_github("jwijffels/RMOA", subdir="RMOAjars/pkg")
install_github("jwijffels/RMOA", subdir="RMOA/pkg")
##
## HoeffdingTree example
##
require(RMOA)
hdt <- HoeffdingTree(numericEstimator = "GaussianNumericAttributeClassObserver")
hdt
## Define a stream - e.g. a stream based on a data.frame
data(iris)
iris <- factorise(iris)
irisdatastream <- datastream_dataframe(data=iris)
## Train the HoeffdingTree on the iris dataset
mymodel <- trainMOA(model = hdt,
formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length,
data = irisdatastream)
## Predict using the HoeffdingTree on the iris dataset
scores <- predict(mymodel, newdata=iris, type="response")
table(scores, iris$Species)
scores <- predict(mymodel, newdata=iris, type="votes")
head(scores)
##
## Boosted set of HoeffdingTrees
##
irisdatastream <- datastream_dataframe(data=iris)
mymodel <- OzaBoost(baseLearner = "trees.HoeffdingTree", ensembleSize = 30)
mymodel <- trainMOA(model = mymodel,
formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length,
data = irisdatastream)
## Predict
scores <- predict(mymodel, newdata=iris, type="response")
table(scores, iris$Species)
scores <- predict(mymodel, newdata=iris, type="votes")
head(scores)