Popularity bigdata / large data packages in R and ffbase useR presentation

A few weeks ago, Rstudio released it's download logs, showing who downloaded R packages through their CRAN mirror. More info: http://blog.rstudio.org/2013/06/10/rstudio-cran-mirror/

This is very nice information and it can be used to show the popularity of packages with R, which has been done before and criticized also as the RStudio logs might/might not be representative for the download behaviour of all useRs.
As the useR2013 conference has come to an end, one of the topics corporate useRs of R seem to be talking about is how to speed up R and how R handles large data.
Edwin & BNOSAC did their fair share by giving a presentation about the use of ffbase alongside the ff package which can be found here .
When looking at twitter feeds (https://twitter.com/search?q=user2013), there is now Tibco who has it's own R interpreter, there is R inside the JVM, Rcpp, Revolution R, ff/ffbase, R inside Oracle, there is pbdR, pretty quick R (pqR), MPI, R on grids, R with mongo/monet-DB, PL/R, dplyr and useRs made a lot of presentations about how they handled large data in their business setting. It seems like the use of R with large datasets is being more and more accepted in the corporate world - which is a good thing. And we love the diversity!
For R packages which are on CRAN, the Rstudio download logs can be used to show download statistics of the open source bigdata / large data packages which are now on the market (CRAN). 
For this, the logs were downloaded and a number of open source packages which are out-of-memory / bigdata solutions in R were compared with respect to download stats on this mirror.
It seems like by far the most popular package is ff and our own contribution (ffbase) is not doing bad at all (+/- 100 ip addresses downloaded our package per week from the Rstudio CRAN mirror only). 
If you are interested in the code to download the data and get the plot or if you want to compare your own packages, you can use the following code.
## Rstudio logs
input <- list()
input$path <- getwd()
input$path <- "/home/janw/Desktop/ffbaseusage"
input$start <- as.Date('2012-10-01')
input$today <- as.Date('2013-06-10')
input$today <- Sys.Date()-1
input$all_days <- seq(input$start, input$today, by = 'day')
input$all_days <- seq(input$start, input$today, by = 'day')
input$urls <- paste0('http://cran-logs.rstudio.com/', 
                     as.POSIXlt(input$all_days)$year + 1900, '/', input$all_days, '.csv.gz')
## Download
sapply(input$urls, FUN=function(x, path) {
  try(download.file(x, destfile = file.path(path, strsplit(x, "/")[[1]][[5]])))
}, path=input$path)

## Import the data in a csv and put it in 1 ffdf
files <- sort(list.files(input$path, pattern = ".csv.gz$"))
rstudiologs <- NULL
for(file in files){
  con <- gzfile(file.path(input$path, file))
  x <- read.csv(con, header=TRUE, colClasses = c("Date","character","integer", rep("factor", 6), "numeric"))
  x$time <- as.POSIXct(strptime(sprintf("%s %s", x$date, x$time), "%Y-%m-%d %H:%M:%S"))
  rstudiologs <- rbind(rstudiologs, as.ffdf(x))
rstudiologs <- subset(rstudiologs, as.Date(time) >= as.Date("2012-12-31"))
ffsave(rstudiologs, file = file.path(input$path, "rstudiologs"))

tmp <- ffload(file.path(input$path, "rstudiologs"), rootpath = tempdir())
rstudiologs[1:2, ]
packages <- c("ff","ffbase","bigmemory","mmap","filehash","pbdBASE","colbycol","MonetDB.R")
idx <- rstudiologs$package %in% ff(factor(packages))
idx <- ffwhich(idx, idx == TRUE)
mypackages <- rstudiologs[idx, ]
mypackages <- as.data.frame(mypackages)
info <- c("r_version","r_arch","r_os","package","version","country")
mypackages[info] <- apply(mypackages[info], MARGIN=2, as.character)
mypackages <- as.data.table(mypackages)
mypackages$aantal <- 1
mondayofweek <- function(x){
    weekday <- as.integer(format(x, "%w"))
    as.Date(ifelse(weekday == 0, x-6, x-(weekday-1)), origin=Sys.Date()-as.integer(Sys.Date()))
mypackages$date <- mondayofweek(mypackages$date)
byday <- mypackages[, 
                    list(aantal = sum(aantal), 
                         ips = length(unique(ip_id))), 
                    by = list(package, date)]
byday <- subset(byday, date != max(as.character(byday$date)))

byday <- transform(byday, package=reorder(package, byday$ips))
qplot( data=byday, y=ips, x=date, color=reorder(package, -ips, mean), geom="line", size=I(1)
) + labs(x="", y="# unique ip", title="Rstudio logs 2013, downloads/week", color="") + theme_bw()


Massive online data stream mining with R

A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.

Most of the standard statistical algorithms require access to all data points and make several iterations over the data and are less suited for usage in R on big datasets.

Streaming algorithms on the other hand are characterised by 
  1. single passes over the data,
  2. using a limited amount of storage space and RAM
  3. work in a limited amount of time
  4. be ready to use the model at any time


The stream package is currently focussed on clustering algorithms available in MOA (http://moa.cms.waikato.ac.nz/details/stream-clustering/) and also eases interfacing with some clustering already available in R which are suited for data stream clustering.  Classification algorithms based on MOA are on the todo list.
Current available clustering algorithms are BIRCH, CluStream, ClusTree, DBSCAN, DenStream, Hierarchical, Kmeans and Threshold Nearest Neighbor.
The stream package allows you to easily extend the use of the models with different data sources. These can be SQL sources, Hadoop, Storm, Hive, simple csv files, flat files or other connections. It is quite easy to extend it towards other connections. As an example, the following code available at this gist (https://gist.github.com/jwijffels/5239198) allows it to connect to an ffdf from the ff package. This allows to do clustering on ff objects.
Below, you can find a toy example showing streaming clustering in R based on data in an ffdf. 
  • Load the packages & the Data Stream Data for ffdf objects

  • Set up a data stream
myffdf <- as.ffdf(iris)
myffdf <- myffdf[c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
mydatastream <- DSD_FFDFstream(x = myffdf, k = 100, loop=TRUE) 
  • Build the streaming clustering model
#### Get some points from the data stream
get_points(mydatastream, n=5)

#### Cluster (first part)
myclusteringmodel <- DSC_CluStream(k = 100)
cluster(myclusteringmodel, mydatastream, 1000)

#### Cluster (second part)
kmeans <- DSC_Kmeans(3)
recluster(kmeans, myclusteringmodel)
plot(kmeans, mydatastream, n = 150, main = "Streaming model - with 3 clusters")

This approach is a standard 2-step approach which combines streaming micro clustering with macro clustering using a basic kmeans algorithm.
If you need help in understanding how your data can help you, if you need training and support on the efficient use of R, let us know how we can help you out.


bigglm on your big data set in open source R, it just works - similar as in SAS

In a recent post by Revolution Analytics (link & link) in which Revolution was benchmarking their closed source generalized linear model approach with SAS, Hadoop and open source R, they seemed to be pointing out that there is no 'easy' R open source solution which exists for building a poisson regression model on large datasets. 

This post is about showing that fitting a generalized linear model to large data in R <is> easy in open source R and just works.
For this we recently included bigglm.ffdf in package ffbase to integrate it more closely with package biglm. That was pretty easy as the help of the chunk function in package ff already shows how to do it and the code in the biglm package is readily available to do some simple code modifications. 
Let's show how it works on some readily available data available here.
The following code shows some features (laf_open_csv, read.csv.ffdf, table.ff, binned_sum.ff, biglm.ffdf, expand.ffgrid and merge.ffdf) of package ffbase and package ff which can be used in a standard setting where you have your large data, want to profile it, see some bivariate statistics and build a simple regression model to predict or understand your target.
It imports a flat file in an ffdf, shows some univariate statistics, does a fast group by and builds a linear regression model. 
All without RAM problems as the data is in ff.
  • Download the data

download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Downloads/2010_Carrier_PUF.zip", "2010_Carrier_PUF.zip")


  • Import it (showing 2 options - either by using package LaF or with read.csv.ffdf using argument transFUN to recode the input data according to the codebook which you can find here)
## the LaF package is great if you are working with fixed-width format files but equally good for csv files and laf_to_ffdf does what it
## has to do: get the data in an ffdf
dat <- laf_open_csv(filename = "2010_BSA_Carrier_PUF.csv", 
column_types = c("integer", "integer", "categorical", "categorical", "categorical", "integer", "integer", "categorical", "integer", "integer", "integer"), column_names = c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count"), skip = 1) x <- laf_to_ffdf(laf = dat) ## the transFUN is easy to use if you want to transform your input data before putting it into the ffdf, ## it applies a function to your read input data which is read in in chunks ## We use it here to recode the numbers to factors according to the code book which you can find in the codebook x <- read.csv.ffdf(file = "2010_BSA_Carrier_PUF.csv", colClasses = c("integer","integer","factor","factor","factor","integer","integer","factor","integer","integer","integer"), transFUN=function(x){ names(x) <- recoder(names(x), from = c("BENE_SEX_IDENT_CD", "BENE_AGE_CAT_CD", "CAR_LINE_ICD9_DGNS_CD", "CAR_LINE_HCPCS_CD", "CAR_LINE_BETOS_CD", "CAR_LINE_SRVC_CNT", "CAR_LINE_PRVDR_TYPE_CD", "CAR_LINE_CMS_TYPE_SRVC_CD", "CAR_LINE_PLACE_OF_SRVC_CD", "CAR_HCPS_PMT_AMT", "CAR_LINE_CNT"), to = c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count")) x$sex <- factor(recoder(x$sex, from = c(1,2), to=c("Male","Female"))) x$age <- factor(recoder(x$age, from = c(1,2), to=c("Under 65", "65-69", "70-74", "75-79", "80-84", "85 and older"))) x$place.served <- factor(recoder(x$place.served, from = c(0, 1, 11, 12, 21, 22, 23, 24, 31, 32, 33, 34, 41, 42, 50, 51, 52, 53, 54, 56, 60, 61, 62, 65, 71, 72, 81, 99), to = c("Invalid Place of Service Code", "Office (pre 1992)", "Office","Home","Inpatient hospital","Outpatient hospital", "Emergency room - hospital","Ambulatory surgical center","Skilled nursing facility", "Nursing facility","Custodial care facility","Hospice","Ambulance - land","Ambulance - air or water", "Federally qualified health centers", "Inpatient psychiatrice facility", "Psychiatric facility partial hospitalization", "Community mental health center", "Intermediate care facility/mentally retarded", "Psychiatric residential treatment center", "Mass immunizations center", "Comprehensive inpatient rehabilitation facility", "End stage renal disease treatment facility", "State or local public health clinic","Independent laboratory", "Other unlisted facility"))) x }, VERBOSE=TRUE) class(x) dim(x)
  • Profile your data
## Data Profiling using table.ff
barplot(table.ff(x$age), col = "lightblue")
barplot(table.ff(x$sex), col = "lightblue")
barplot(table.ff(x$typeofservice), col = "lightblue")

  • Grouping by - showing the speedy binned_sum
## Basic & fast group by with ff data
doby <- list()
doby$sex <- binned_sum.ff(x = x$payment, bin = x$sex, nbins = length(levels(x$sex)))
doby$age <- binned_sum.ff(x = x$payment, bin = x$age, nbins = length(levels(x$age)))
doby$place.served <- binned_sum.ff(x = x$payment, bin = x$place.served, nbins = length(levels(x$place.served)))
doby <- lapply(doby, FUN=function(x){
  x <- as.data.frame(x)
  x$mean <- x$sum / x$count
doby$sex$sex <- recoder(rownames(doby$sex), from = rownames(doby$sex), to = levels(x$sex))
doby$age$age <- recoder(rownames(doby$age), from = rownames(doby$age), to = levels(x$age))
doby$place.served$place.served <- recoder(rownames(doby$place.served), from = rownames(doby$place.served), to = levels(x$place.served))

  • Build a generalized linear model using package biglm which integrates with ffbase::bigglm.ffdf
## Make a linear model using biglm
mymodel <- bigglm(payment ~ sex + age + place.served, data = x)
# This will overflow your RAM as it will get your data from ff into RAM
#summary(glm(payment ~ sex + age + place.served, data = x[,c("payment","sex","age","place.served")]))

  • Do the same on more data: 280Mio records
## Ok, we were working only on +/- 2.8Mio records which is not big, let's explode the data by 100 to get 280Mio records 
x$id <- ffseq_len(nrow(x))
xexploded <- expand.ffgrid(x$id, ff(1:100)) # Had to wait 3 minutes on my computer
colnames(xexploded) <- c("id","explosion.nr")
xexploded <- merge(xexploded, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE) ## this uses merge.ffdf, might take 30 minutes
dim(xexploded) ## hopsa, 280 Mio records and 13.5Gb created
sum(.rambytes[vmode(xexploded)]) * (nrow(xexploded) * 9.31322575 * 10^(-10))
## And build the linear model again on the whole dataset
mymodel <- bigglm(payment ~ sex + age + place.served, data = xexploded)
Hmm, it looks like people who got help by an Ambulance at sea or an airplane ambulance had to pay more.
  • That wasn't that easy or was it. Now your turn

RBelgium meeting on November, 16

rbelgiumlogo v1

Next week on Friday, November 16, the RBelgium R user group is holding its next Regular meeting in Brussels.

This is the schedule of the upcoming RBelgium Regular meeting:

* Graphical User Interface developments around R, including tcltk2 and SciViews - Philippe Grosjean (UMons)
* Using R via the Amazon Cloud - Jean-Baptiste Poullet (stat'Rgy)
* Literature review: R books - Brecht Devleesschauwer (UGent, UCL)

The meeting will take place on Friday 16 November, at 18h45, at the ULB Campus de la Plaine. Everyone is welcome to join!

R courses in Belgium

Every year, the Leuven Statistics Research Center (Belgium) is offering short courses for professionals and researchers in statistics and statistical tools.

lstat training

The following link shows the overview of the courses: http://lstat.kuleuven.be/consulting/shortcourses/ENcourse%20overview.htm or get it here in pdf: http://lstat.kuleuven.be/consulting/shortcourses/BRO_LSTAT_2012-2013.pdf

This year, BNOSAC is presenting the course on Advanced R Programming Topics, which will be held on Oktober 18-19.

This course is a hands-on course covering the basic toolkit you need to have in order to use R efficiently for data analysis tasks. It is an intermediate course aimed at users who have the knowledge from the course 'Essential tools for R' and who want to go further to improve and speed up their data analysis tasks.

The following topics will be covered in detail

  • The apply family of functions and basic parallel programming for these, vectorisation, regular expressions, string manipulation functions and commonly used functions from the base package. Useful other packages for data manipulation.
  • Making a basic reproducible report using Sweave and knitr including tables, graphs and literate programming
  • If you want to build your own R package to distribute your work, you need to understand S3 and S4 methods, you need the basics of how generics work as well as R environments, what are namespaces and how are they useful. This will be covered to help you start up and build an R package.
  • Basic tips on how to organise and develop R code and test it.

You can subscribe here: