crfsuite for natural language processing

A new R package called crfsuite supported by BNOSAC landed safely on CRAN last week. The crfsuite package (https://github.com/bnosac/crfsuite) is an R package specific to Natural Language Processing and allows you to easily build and apply models for

  • named entity recognition
  • text chunking
  • part of speech tagging
  • intent recognition or
  • classification of any category you have in mind

The focus of the implementation is on allowing the R user to build such models on his/her own data, with your own categories. The R package is a Rcpp interface to the popular crfsuite C++ package which is used a lot in all kinds of chatbots.

In order to facilitate creating training data on your own data, a shiny app is made available in this R package which allows you to easily tag your own chunks of text, with your own categories, which can next be used to build a crfsuite model. The package also plays nicely together with the udpipe R package (https://CRAN.R-project.org/package=udpipe), which you need in order to extract predictive features (e.g. parts of speech tags) for your words to be used in the crfsuite model.

On a side-note. If you are in the area of NLP, you might also be interested in the upcoming ruimtehol R package which is a wrapper around the excellent StarSpace C++ code providing word/sentence/document embeddings, text-based classification, content-based recommendation and similarities as well as entity relationship completion.

app screenshot

You can get going with the crfsuite package as follows. Have a look at the package vignette, it shows you how to construct and apply your own crfsuite model.

## Install the packages
install.packages("crfsuite")
install.packages("udpipe")

## Look at the vignette
library(crfsuite)
library(udpipe)
vignette("crfsuite-nlp", package = "crfsuite")

More details at the development repository https://github.com/bnosac/crfsuite where you can also provide feedback.

Training on Text Mining 

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

Last call for the course on text mining of next week

Last call for the 2-day course on Text Mining with R, held next week (08-09 October 2018) in Brussels, Belgium. Subscribe at https://www.eventbrite.co.uk/e/dsb2018-text-mining-with-r-jan-wijffels-bnosac-session-03-04-tickets-50586501588

You'll learn during that course the following:

  • Cleaning of text data, regular expressions
  • String distances
  • Graphical displays of text data
  • Natural language processing: stemming, parts-of-speech tagging, tokenization, lemmatisation, dependency parsing, noun phrase detection and keyword extraction
  • Entity recognition & chunking using Conditional Random Fields
  • Sentiment analysis
  • Statistical topic detection modelling (latent dirichlet allocation)
  • Visualisation of correlations & topics
  • Automatic classification using predictive modelling based on text data
  • Word and Text embeddings
  • Document similarities & Text alignment

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 08-09/10/2018: Text mining with R. Brussels (Belgium). Subscribe at https://www.eventbrite.co.uk/e/dsb2018-text-mining-with-r-jan-wijffels-bnosac-session-03-04-tickets-50586501588
  • 15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

udpipe version 0.7 for Natural Language Processing (#NLP) alongside #tidytext, #quanteda, #tm

udpipe rlogoThis blogpost announces the release of the udpipe R package version 0.7 on CRAN. udpipe is an R package which does tokenization, parts of speech tagging, lemmatization, morphological feature tagging and dependency parsing. It's main feature is that it is a lightweight R package which works on more than 50 languages and gives you rich NLP output out of the box. 

The package was updated mainly in order to more easily work with the crfsuite R package which does entity/intent recogntion and chunking. The user-visible changes that were made are that udpipe now has a shorthand for working with text in the TIF format and it now also allows to indicate the location of the token inside the original text. Next to this, version 0.7 also caches the udpipe models.

Example

Using udpipe (version >= 0.7) works as follows. First download the model of your language and next do the annotation.

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
x <- udpipe("De federale regering besliste vandaag dat er een neutrale verpakking voor roltabak en sigaretten komt",
object = udmodel)

Since version 0.7, you can now also directly indicate the language. This will download the udpipe annotation model if it is not already downloaded. Please inspect the help of udpipe_download_model for more details on the languages available and the license of these.

x <- udpipe("Je veux qu’on me juge pour ce que je suis et non pour ce qu’était mon père", "french")
x <- udpipe("Europa lança taxas sobre navios para tirar lixo do fundo do mar.", "portuguese")
x <- udpipe("आपके इस स्नेह्पूर्ण और जोरदार स्वागत से मेरा हृदय आपार हर्ष से भर गया है। मैं आपको दुनिया के सबसे पौराणिक भिक्षुओं की तरफ से धन्यवाद् देता हूँ। मैं आपको सभी धर्मों की जननी कि तरफ से धन्यवाद् देता हूँ, और मैं आपको सभी जाति-संप्रदाय के लाखों-करोड़ों हिन्दुओं की तरफ से धन्यवाद् देता हूँ। मेरा धन्यवाद् उन वक्ताओं को भी जिन्होंने ने इस मंच से यह कहा कि दुनिया में शहनशीलता का विचार सुदूर पूरब के देशों से फैला है।", "hindi")
x <- udpipe("The economy is weak but the outlook is bright", "english")
x <- udpipe("Maxime y su mujer hicieron que nuestra estancia fuera lo mas comoda posible", "spanish")
x <- udpipe("A félmilliárdos MVM-támogatásból 433 milliót négy nyúlfarknyi imázsvideóra költött", "hungarian")
x <- udpipe("同樣,施力的大小不同,引起的加速度不同,最終的結果也不一樣,亦可以從向量的加成性來看", "chinese")

The result is a a data.frame with one row per doc_id and term_id containing all the tokens in the data, the lemma, the part of speech tags, the morphological features, the dependency relationship along the tokens and the location where the token is found in the original text.

screenshot udpipe chinese

Use alongside other R packages

R has a rich NLP ecosystem. If you want to use udpipe alongside other R packages, let's enumerate some basic possibilities where we show how to easily extract lemma's and text of parts of speech tags you are interested in:
Below we show how to use udpipe alongside the 3 popular R packages: tidytext, quanteda and tm on the following data.frame in TIF format.

rawdata <- data.frame(doc_id = c("doc1", "doc2"), 
                      text = c("The economy is weak but the outlook is bright.",
"Natural Language Processing has never been more easy than this."),
                      stringsAsFactors = FALSE)

Using tidytext

In this code, we let tidytext do tokenisation and use udpipe to enrich the token list. Next we subset the data.frame of tokens by extracting only proper nouns, nouns and adjectives.

library(tidytext)
library(udpipe)
library(dplyr)
x <- unnest_tokens(rawdata, input = text, output = word)
x <- udpipe(split(x$word, x$doc_id), "english")
x <- filter(x, upos %in% c("PROPN", "NOUN", "ADJ"))

Using quanteda

In the code below, we let udpipe do tokenisation and provide the lemma's back in quanteda's tokens element.

library(quanteda)
library(udpipe)
x <- corpus(rawdata, textField = "text")
tokens <- udpipe(texts(x), "english")
x$tokens <-  as.tokenizedTexts(split(tokens$lemma, tokens$doc_id))

Using tm

Below, we get only the lemma's of the nouns, proper nouns and adjectives and apply this using the tm_map functionality from tm.

library(tm)
library(udpipe)
x <- VCorpus(VectorSource(rawdata$text))
x <- tm_map(x, FUN=function(txt){
  data <- udpipe(content(txt), "english")
  data <- subset(data, upos %in% c("PROPN", "NOUN", "ADJ"))
  paste(data$lemma, collapse = " ")
})

UDPipe currently already uses deep learning techniques (e.g a GRU network) for doing the tokenisation but the dependency parsing was enhanced in 2018 by incorporating tensorflow. On the roadmap for a next release will be the integration of the UDPipe future enhancements (which got 3rd place at the CoNLL shared task from 2018) including these tensorflow components.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 08-09/10/2018: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
  • 15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

How to detect hatespeech in plain text #schildnvrienden

Yesterday there was a pretty controversial Pano TV documentary called 'Wie is Schild & Vrienden echt' at the national television channel 'één' (https://www.vrt.be/vrtnu/a-z/pano/2018/pano-s2018a10). The documentary revealed the internal communication of a right-wing group from Belgium, called #schildnvrienden.

After that, there was a show by Van Gils & gasten where a representative of the police explained or tried not to explain how the police can or can not monitor online private groups. It was pretty hilarious how she tried to manage not to say anything about the internal online monitoring system they apparently have.

That reminded me that a few years ago, I created an R package which can easily detect hate speech. I finally put it online on github today. You can find it here. The R package used a dictionary which is made available by the University of Antwerp which I think is the basis of the hate speech detection algorithms that currently the police in Belgium is running.

Example

How does that hate speech detection system work? Pretty simple, a dictionary of hate speech terminology and hate speech regular expressions are set up and next you just provide some text to it, the data is being cut up into words and it sees which words are part of the dictionary. As an example below, let's try it out on a message by the leader of that #schildnvrienden group to see if it is considered hate speech.

screenshot twitter dvanlangenhove 20180901

library(udpipe)
library(hatespeech.dutch)
detect_hatespeech("Europa wordt élke dag geteisterd door geweld van illegalen.
Zowel voor mensen die zich zorgen maken over dit geweld als voor mensen
 die zich zorgen maken over de boze reactie van Europeanen òp dit geweld zou
 oplossing duidelijk moeten zijn: alle illegalen opsporen en deporteren.",
 type = "udpipe")
    Neutral-Country   Neutral-Migration Neutral-Nationality 
                  0                   1                   0 
   Neutral-Religion  Neutral-Skin_color      Racist-Animals 
                  0                   0                   0 
     Racist-Country        Racist-Crime      Racist-Culture 
                  0                   0                   0 
    Racist-Diseases    Racist-Migration  Racist-Nationality 
                  0                   0                   0 
        Racist-Race     Racist-Religion   Racist-Skin_color 
                  0                   0                   0 
 Racist-Stereotypes 
                  0

So apparently the dictionary logic considers this statement as Neutral-Migration. Hope the police have improved on the natural language processing a bit such that they have incorporated a bit more than just word lookup and regular expressions. Feel free to try the hate speech detector out on your own text using the R package made available at https://github.com/weRbelgium/hatespeech.dutch. Or visit the website to see to the dictionaries which are used to detect hate speech.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.

  • 08-09/10/2018: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
  • 15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

Upcoming public courses on Text mining with R, Statistical machine learning with R, Applied Spatial Modelling with R, Advanced R programming, Computer Vision and Image Recognition

I'm happy to announce that the following list of courses for R users is ready to be booked. All courses are face-to-face courses held in Belgium.

  • 08-09/10/2018: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
  • 15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

For a list of all R courses which can be given in-house at your site, visit http://www.bnosac.be/index.php/training and get in touch to schedule the course.

r training