Starspace for NLP #nlproc

Our recent addition to the NLP R universe is called R package ruimtehol which is open sourced at https://github.com/bnosac/ruimtehol This R package is a wrapper around Starspace which provides a neural embedding model for doing the following on text:

Text classification
Learning word, sentence or document level embeddings
Finding sentence or document similarity
Ranking web documents
Content-based recommendation (e.g. recommend text/music based on the content)
Collaborative filtering based recommendation (e.g. recommend text/music based on interest)
Identification of entity relationships

If you are an R user and are interested in NLP techniques. Feel free to test out the framework and provide feedback at https://github.com/bnosac/ruimtehol/issues. The package is not on CRAN yet, but can be installed easily with the command devtools::install_github("bnosac/ruimtehol", build_vignettes = TRUE).

Below is an example how the package can be used for multi-label classification on questions asked in Belgian parliament. Each question in parliament was labelled with several of one of the 1785 categories.

library(ruimtehol)
data(dekamer, package = "ruimtehol")

## Each question in parliament was labelled with more than 1 category. There are 1785 categories in this dataset
dekamer$question_themes <- strsplit(dekamer$question_theme, " +\\| +")
## Plain text of the question in parliament
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- sapply(dekamer$text, FUN=function(x) paste(x, collapse = " "))
dekamer$text <- tolower(dekamer$text)

## Build starspace model
model <- embed_tagspace(x = dekamer$text, 
                        y = dekamer$question_themes, 
                        dim = 50, 
                        ngram = 3, loss = "hinge", similarity = "cosine", adagrad = TRUE,
                        early_stopping = 0.8, minCount = 2, 
                        thread = 4)

## Get embeddings of the dictionary of words as well as the categories
embedding_words  <- as.matrix(model, type = "words")
embedding_labels <- as.matrix(model, type = "label")

## Find closest labels / predict
embedding_combination <- starspace_embedding(model, "federale politie patrouille", type = "document")
embedding_similarity(embedding_combination, 
                     embedding_labels, 
                     top_n = 3)

term1                                            term2 similarity rank
federale politie patrouille           __label__POLITIE  0.8480641    1
federale politie patrouille          __label__OPENBARE  0.6919607    2
federale politie patrouille __label__BEROEPSMOBILITEIT  0.6907637    3

predict(model, "de migranten komen naar europa, in asielcentra ...")
$input
"de migranten komen naar europa, in asielcentra ..."
$prediction
                label               label_starspace similarity
 VLUCHTELINGENCENTRUM __label__VLUCHTELINGENCENTRUM  0.7075160
          VLUCHTELING          __label__VLUCHTELING  0.6253517
             ILLEGALE             __label__ILLEGALE  0.5997692
       MIGRATIEBELEID       __label__MIGRATIEBELEID  0.5939595
           UITWIJZING           __label__UITWIJZING  0.5376520

The list of R packages regarding text mining with R provided by BNOSAC has been steadily growing. This is the list of R packages maintained by BNOSAC.

udpipe: tokenisation, lemmatisation, parts of speech tagging, dependency parsing, morphological feature extraction, sentiment scoring, keyword extraction, NLP flows
crfsuite: named entity recognition, text classification, chunking, sequence modelling
textrank: text summarisation
ruimtehol: text classification, word/sentence/document embeddings, document/label similarities, ranking documengs, content based recommendation, collaborative filtering-based recommendation

More details of ruimtehol at the development repository https://github.com/bnosac/ruimtehol where you can also provide feedback.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months. rtraining

19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here