Just before the summer holidays, BNOSAC presented a talk called Computer Vision and Image Recognition algorithms for R users at the UseR conference. In the talk 6 packages on Computer Vision with R were introduced in front of an audience of about 250 persons. The R packages we covered and that were developed by BNOSAC are:
- image.CornerDetectionF9: FAST-9 corner detection
- image.CannyEdges: Canny Edge Detector
- image.LineSegmentDetector: Line Segment Detector (LSD)
- image.ContourDetector: Unsupervised Smooth Contour Line Detection
- image.dlib: Speeded up robust features (SURF) and histogram of oriented gradients (FHOG) features
- image.darknet: Image classification using darknet with deep learning models AlexNet, Darknet, VGG-16, GoogleNet and Darknet19. As well object detection using the state-of-the art YOLO detection system
For those of you who missed this, you can still see the video of the presentation & view the pdf of the presentation below. The packages are open-sourced and made available at https://github.com/bnosac/image
If you have a computer vision endaveour in mind, feel free to get in touch for a quick chat. For those of you interested in following training on how to do image analysis, you can always register for our training on Computer Vision with R and Python here. More details on the full training program and training dates provided by BNOSAC: visit http://bnosac.be/index.php/training
Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, ... . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.
- The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
- Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
- OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
- Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
- SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.
Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:
- Easily installable in a corporate environment as a simple R package based on rJava
- Covering more than 40 languages:
UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
- Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
- Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
- Cross-platform running on Windows/Linux/Mac
- It allows to do the Morphological, POS tagging and universal POS tagging of sentences
The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ' A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging' available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:
## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)
## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
"Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
" ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)
## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)
## This gives the following output
sentence.id word.id word word.type
1 1 Dus ADV
1 2 godvermehoeren VERB
1 3 met ADP
1 4 pus NOUN
1 5 in ADP
1 6 alle PRON
1 7 puisten NOUN
1 8 , PUNCT
1 9 zei VERB
1 10 die PRON
1 11 schele ADJ
1 12 van ADP
1 13 Van PROPN
1 14 Bukburg PROPN
1 15 . PUNCT
2 1 Er ADV
2 2 was AUX
2 3 toen SCONJ
2 4 dat SCONJ
2 5 liedje NOUN
2 6 van ADP
2 7 tietenkonttieten VERB
2 8 kont PROPN
2 9 tieten VERB
2 10 kontkontkont PROPN
2 11 . PUNCT
3 0 <NA> <NA>
4 0 <NA> <NA>
5 0 <NA> <NA>
The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.
Good luck with text mining.
If you need our help for a text mining project. Let us know, we'll be glad to get you started.
2 new R packages were put on CRAN last week by BNOSAC (www.bnosac.be).
- One package for scheduling R scripts and processes on Windows (taskscheduleR) and
- Another package for scheduling R scripts and processes on Unix / Linux (cronR)
These 2 packages allow you to schedule R processes from R directly. This is done by passing commands directly to cron which is a basic Linux/Unix job scheduling utility or by using the Windows Task Scheduler. The packages were developed for beginning R users who are unaware of that fact that R scripts can also be run non-interactively and can be automated.
We blogged already about the taskscheduleR R package at this blog post and also here. This time we devote some more details to the cronR R package.
The cronR package allows to
- Get the list of scheduled jobs
- Remove scheduled jobs
- Add a job
- a job is basically a script with R code which is run through Rscript
- You can schedule tasks 'ONCE', 'EVERY MINUTE', 'EVERY HOUR', 'EVERY DAY', 'EVERY WEEK', 'EVERY MONTH' or any complex schedule
- The task log contains the stdout & stderr of the Rscript which was run on that timepoint. This log can be found at the same folder as the R script
The package is especially suited for persons working on an RStudio server in the cloud or within the premises of their corporate environment. It allows to easily schedule processes. To make that extremely easy for beginning R users, an RStudio addin was developed, which is shown in the example below. The RStudio addin basically allows you to select an R script and schedule it at specific timepoints. It does this by copying the script to your launch/log folder and setting up a cronjob for that script.
The example below shows how to set up a cron job using the RStudio addin so that the scripts are launched every minute or every day at a specific hour. The R code is launched through Rscript and the log will contain the errors and the warnings in case your script failed so that you can review where the code failed.
Mark that you can also pass on arguments to the R script so that you can launch the same script for productXYZ and productABC.
Of course scheduling scripts can also be done from R directly. Some examples are shown below. More information at https://github.com/bnosac/cronR
f <- system.file(package = "cronR", "extdata", "helloworld.R")
cmd <- cron_rscript(f, rscript_args = c("productx", "20160101"))
## Every minute
cron_add(cmd, frequency = 'minutely', id = 'job1', description = 'Customers')
## Every hour at 20 past the hour on Monday and Tuesday
cron_add(cmd, frequency = 'hourly', id = 'job2', at = '00:20', description = 'Weather', days_of_week = c(1, 2))
## Every day at 14h20 on Sunday, Wednesday and Friday
cron_add(cmd, frequency = 'daily', id = 'job3', at = '14:20', days_of_week = c(0, 3, 5))
## Every starting day of the month at 10h30
cron_add(cmd, frequency = 'monthly', id = 'job4', at = '10:30', days_of_month = 'first', days_of_week = '*')
## Get all the jobs
## Remove all scheduled jobs
We hope this will gain you some precious time and if you need more help on automating R processes, feel free to get into contact. We have a special training devoted to managing R processes which can be given in your organisation. More information at our training curriculum.
As part of our data science training initiative, bnosac is also providing a course on computer vision with R & Python which is held in March 9-10 in Leuven, Belgium (subscribe here or have a look at our full training offer here). Part of the course is covering finding blobs, corners, gradients, edges & lines in images.
For this reason, the R package image.LineSegmentDetector was made available at https://github.com/bnosac/image. It allows to detect segment lines in digital images. An example of this is shown below.
## Read in the image + make sure input to the algorithm is matrix with grey-scale values in 0-255 range
imagelocation <- system.file("extdata", "chairs.pgm", package="image.LineSegmentDetector")
image <- read.pnm(file = imagelocation, cellres = 1)
x <- image@grey * 255
## Detect and plot the lines segments
linesegments <- image_line_segment_detector(x)
plot(linesegments, add = TRUE, col = "red")
The line segment detector finds lines in digital grey-scale images and is an implementation of the linear-time Line Segment Detector explained at https://doi.org/10.5201/ipol.2012.gjmr-lsd. It gives subpixel accurate results and is designed to work on any digital image without parameter tuning. It controls its own number of false detections where on average, one false alarm is allowed per image.
More information here https://github.com/bnosac/image
The algorithm requires as input a grey-scale image. So if you have another image, you can use the excellent magick package to transform it to grey scale.
f <- tempfile(fileext = ".pgm")
x <- image_read("http://www.momentumshipping.net/lounge/wp-content/uploads/2015/10/containership2.jpg")
x <- image_convert(x, format = "pgm", depth = 8)
image_write(x, path = f, format = "pgm")
image <- read.pnm(file = f, cellres = 1)
linesegments <- image_line_segment_detector(image@grey * 255)
plot(linesegments, add = TRUE, col = "red")
The algorithm is implemented with Rcpp and is released under the AGPL-3 license. Hope you enjoy it.
Need support in image recognition? Let us know.