Starspace for NLP #nlproc
Our recent addition to the NLP R universe is called R package ruimtehol which is open sourced at https://github.com/bnosac/ruimtehol This R package is a wrapper around Starspace which provides a neural embedding model for doing the following on text:
- Text classification
- Learning word, sentence or document level embeddings
- Finding sentence or document similarity
- Ranking web documents
- Content-based recommendation (e.g. recommend text/music based on the content)
- Collaborative filtering based recommendation (e.g. recommend text/music based on interest)
- Identification of entity relationships
If you are an R user and are interested in NLP techniques. Feel free to test out the framework and provide feedback at https://github.com/bnosac/ruimtehol/issues. The package is not on CRAN yet, but can be installed easily with the command devtools::install_github("bnosac/ruimtehol", build_vignettes = TRUE).
Below is an example how the package can be used for multi-label classification on questions asked in Belgian parliament. Each question in parliament was labelled with several of one of the 1785 categories.
library(ruimtehol)
data(dekamer, package = "ruimtehol")
## Each question in parliament was labelled with more than 1 category. There are 1785 categories in this dataset
dekamer$question_themes <- strsplit(dekamer$question_theme, " +\\| +")
## Plain text of the question in parliament
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- sapply(dekamer$text, FUN=function(x) paste(x, collapse = " "))
dekamer$text <- tolower(dekamer$text)
## Build starspace model
model <- embed_tagspace(x = dekamer$text,
y = dekamer$question_themes,
dim = 50,
ngram = 3, loss = "hinge", similarity = "cosine", adagrad = TRUE,
early_stopping = 0.8, minCount = 2,
thread = 4)
## Get embeddings of the dictionary of words as well as the categories
embedding_words <- as.matrix(model, type = "words")
embedding_labels <- as.matrix(model, type = "label")
## Find closest labels / predict
embedding_combination <- starspace_embedding(model, "federale politie patrouille", type = "document")
embedding_similarity(embedding_combination,
embedding_labels,
top_n = 3)
term1 term2 similarity rank
federale politie patrouille __label__POLITIE 0.8480641 1
federale politie patrouille __label__OPENBARE 0.6919607 2
federale politie patrouille __label__BEROEPSMOBILITEIT 0.6907637 3
predict(model, "de migranten komen naar europa, in asielcentra ...")
$input
"de migranten komen naar europa, in asielcentra ..."
$prediction
label label_starspace similarity
VLUCHTELINGENCENTRUM __label__VLUCHTELINGENCENTRUM 0.7075160
VLUCHTELING __label__VLUCHTELING 0.6253517
ILLEGALE __label__ILLEGALE 0.5997692
MIGRATIEBELEID __label__MIGRATIEBELEID 0.5939595
UITWIJZING __label__UITWIJZING 0.5376520
The list of R packages regarding text mining with R provided by BNOSAC has been steadily growing. This is the list of R packages maintained by BNOSAC.
- udpipe: tokenisation, lemmatisation, parts of speech tagging, dependency parsing, morphological feature extraction, sentiment scoring, keyword extraction, NLP flows
- crfsuite: named entity recognition, text classification, chunking, sequence modelling
- textrank: text summarisation
- ruimtehol: text classification, word/sentence/document embeddings, document/label similarities, ranking documengs, content based recommendation, collaborative filtering-based recommendation
More details of ruimtehol at the development repository https://github.com/bnosac/ruimtehol where you can also provide feedback.
Training on Text Mining
Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.
- 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
- 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
- 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
- 15/03/2019: Image Recognition with R and Python: Subscribe here
- 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here