Massive online data stream mining with R
A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.
Most of the standard statistical algorithms require access to all data points and make several iterations over the data and are less suited for usage in R on big datasets.
- single passes over the data,
- using a limited amount of storage space and RAM
- work in a limited amount of time
- be ready to use the model at any time
- Load the packages & the Data Stream Data for ffdf objects
require(devtools) require(stream) require(ff) source_gist("5239198")
- Set up a data stream
myffdf <- as.ffdf(iris) myffdf <- myffdf[c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")] mydatastream <- DSD_FFDFstream(x = myffdf, k = 100, loop=TRUE) mydatastream
- Build the streaming clustering model
#### Get some points from the data stream get_points(mydatastream, n=5) mydatastream #### Cluster (first part) myclusteringmodel <- DSC_CluStream(k = 100) cluster(myclusteringmodel, mydatastream, 1000) myclusteringmodel plot(myclusteringmodel) #### Cluster (second part) kmeans <- DSC_Kmeans(3) recluster(kmeans, myclusteringmodel) plot(kmeans, mydatastream, n = 150, main = "Streaming model - with 3 clusters")
If you need help in understanding how your data can help you, if you need training and support on the efficient use of R, let us know how we can help you out.