You did a sentiment analysis with tidytext but you forgot to do dependency parsing to answer WHY is something positive/negative

A small note on the growing list of users of the udpipe R package. In the last month of 2018, we've updated the package on CRAN with some noticeable changes

  • The default models which are now downloaded with the function udpipe_download_model are now models built on Universal Dependencies 2.3 (released on 2018-11-15)
  • This means udpipe now has models for 60 languages. That's right! And they provide tokenisation, parts of speech tagging, lemmatisation and dependency parsing built on all of these treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb.
  • Although this was not intended originally we added a sentiment scoring function in the latest release (version 0.8 on CRAN). Combined with the output of the dependency parsing, this allows to answer questions like 'WHAT IS CAUSING A NEGATIVE SENTIMENT'. Example showing below.
  • If you want to use the udpipe models for commercial purposes, we have some nice extra pretrained models available for you - get in touch if you are looking for this.

Below we will showcase the new features of the R package by finding out what is causing a negative sentiment.

If I see some users of the tidytext sentiment R package I always wondered if they do sentiment scoring for the love of building reports as it looks like the main thing they report is frequency of occurrences of words which are part of a positive or negative dictionary. While probably their manager asked them. "Yeah but why is the sentiment negative or positive".
You can answer this managerial question using dependency parsing and that is exactly what udpipe provides (amongst other NLP annotations). Dependency parsing links each word to another word, allowing us the find out which words are linked to negative words giving you the context of why something is negative and what needs to be improved in your business. Let's show how to get this easily done in R.

Below we get a sample of 500 AirBnb customer reviews in French, annotate it with udpipe (using a French model built on top of Rhapsodie French treebank), use the new sentiment scoring txt_sentiment which is available in the new udpipe release using an online dictionary of positive / negative terms for French. Next we use the udpipe dependency parsing output by looking to the adjectival modifier 'amod' in the dep_rel udpipe output and visualise all words which are linked the the negative terms of the dictionary. The result is this graph showing words of the dictionary in red and words which are linked to that word in another color.

sentiment and dependency parsing

Full code showing how this is done is shown below.

library(udpipe)
library(dplyr)
library(magrittr)
data(brussels_reviews, package = "udpipe")
x <- brussels_reviews %>%
  filter(language == "fr") %>%
  rename(doc_id = id, text = feedback) %>%
  udpipe("french-spoken", trace = 10)
##
## Get a French sentiment dictionary lexicon with positive/negative terms, negators, amplifiers and deamplifiers
##
load(file("https://github.com/sborms/sentometrics/raw/master/data-raw/FEEL_fr.rda"))
load(file("https://github.com/sborms/sentometrics/raw/master/data-raw/valence-raw/valShifters.rda"))
polarity_terms <- rename(FEEL_fr, term = x, polarity = y)
polarity_negators <- subset(valShifters$valence_fr, t == 1)$x
polarity_amplifiers <- subset(valShifters$valence_fr, t == 2)$x
polarity_deamplifiers <- subset(valShifters$valence_fr, t == 3)$x
##
## Do sentiment analysis based on that open French lexicon
##
sentiments <- txt_sentiment(x, term = "lemma",
                            polarity_terms = polarity_terms,
                            polarity_negators = polarity_negators,
                            polarity_amplifiers = polarity_amplifiers,
                            polarity_deamplifiers = polarity_deamplifiers)
sentiments <- sentiments$data
  • Nothing fancy happened here above. We use udpipe for NLP annotation (tokenisation, lemmatisation, parts of speech tagging and dependency parsing). The sentiment scoring not only does a join with the sentiment dictionary but also looks for neighbouring words which might change the sentiment.
  • The resulting dataset looks like this

udpipe enriched

Now we can answer the question - why is something negative

This is done by using the dependency relationship output of udpipe to find out which words are linked to negative words from our sentiment dictionary. Users unfamiliar with dependency relationships, have a look at definitions of possible tags for the dep_rel field at dependency parsing output. In this case we only take 'amod' meaning we are looking for adjectives modifying a noun.

## Use cbind_dependencies to add the parent token to which the keyword is linked
reasons <- sentiments %>%
  cbind_dependencies() %>%
  select(doc_id, lemma, token, upos, sentiment_polarity, token_parent, lemma_parent, upos_parent, dep_rel) %>%
  filter(sentiment_polarity < 0)
head(reasons)
  • Now instead of making a plot showing which negative words appear which tidytext users seem to be so keen of, we can make a plot showing the negative words and the words which these negative terms are linked to indicating the context of the negative term.
  • We select the lemma's of the negative words and the lemma of the parent word and calculate how many times they occur together
reasons <- filter(reasons, dep_rel %in% "amod")
word_cooccurences <- reasons %>%
  group_by(lemma, lemma_parent) %>%
  summarise(cooc = n()) %>%
  arrange(-cooc)
vertices <- bind_rows(
  data_frame(key = unique(reasons$lemma)) %>% mutate(in_dictionary = if_else(key %in% polarity_terms$term, "in_dictionary", "linked-to")),
  data_frame(key = unique(setdiff(reasons$lemma_parent, reasons$lemma))) %>% mutate(in_dictionary = "linked-to"))
  • The following makes the visualisation using ggraph.
library(magrittr)
library(ggraph)
library(igraph)
cooc <- head(word_cooccurences, 20)
set.seed(123456789)
cooc %>%  
  graph_from_data_frame(vertices = filter(vertices, key %in% c(cooc$lemma, cooc$lemma_parent))) %>%
  ggraph(layout = "fr") +
  geom_edge_link0(aes(edge_alpha = cooc, edge_width = cooc)) +
  geom_node_point(aes(colour = in_dictionary), size = 5) +
  geom_node_text(aes(label = name), vjust = 1.8, col = "darkgreen") +
  ggtitle("Which words are linked to the negative terms") +
  theme_void()

This generated the image shown above, showing context of negative terms. Now go do this on your own data.

If you are interested in the techniques shown above, you might also be interested in our recent open-sourced NLP developments:

  • textrank: text summarisation
  • crfsuite: entity recognition, chunking and sequence modelling
  • BTM: biterm topic modelling on short texts (e.g. survey answers / twitter data)
  • ruimtehol: neural text models on top of Starspace (neural models for text categorisation, word/sentence/document embeddings, document recommendation, entity link completion and entity embeddings)
  • udpipe: general NLP package for tokenisation, lemmatisation, parts of speech tagging, morphological annotations, dependency parsing, keyword extraction and NLP flows

Enjoy!

 

Starspace for NLP #nlproc

Our recent addition to the NLP R universe is called R package ruimtehol which is open sourced at https://github.com/bnosac/ruimtehol This R package is a wrapper around Starspace which provides a neural embedding model for doing the following on text:

  • Text classification
  • Learning word, sentence or document level embeddings
  • Finding sentence or document similarity
  • Ranking web documents
  • Content-based recommendation (e.g. recommend text/music based on the content)
  • Collaborative filtering based recommendation (e.g. recommend text/music based on interest)
  • Identification of entity relationships

logo ruimtehol

If you are an R user and are interested in NLP techniques. Feel free to test out the framework and provide feedback at https://github.com/bnosac/ruimtehol/issues. The package is not on CRAN yet, but can be installed easily with the command devtools::install_github("bnosac/ruimtehol", build_vignettes = TRUE).

Below is an example how the package can be used for multi-label classification on questions asked in Belgian parliament. Each question in parliament was labelled with several of one of the 1785 categories.

library(ruimtehol)
data(dekamer, package = "ruimtehol")

## Each question in parliament was labelled with more than 1 category. There are 1785 categories in this dataset
dekamer$question_themes <- strsplit(dekamer$question_theme, " +\\| +")
## Plain text of the question in parliament
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- sapply(dekamer$text, FUN=function(x) paste(x, collapse = " "))
dekamer$text <- tolower(dekamer$text)
## Build starspace model
model <- embed_tagspace(x = dekamer$text,
                        y = dekamer$question_themes,
                        dim = 50,
                        ngram = 3, loss = "hinge", similarity = "cosine", adagrad = TRUE,
                        early_stopping = 0.8, minCount = 2,
                        thread = 4)
## Get embeddings of the dictionary of words as well as the categories
embedding_words  <- as.matrix(model, type = "words")
embedding_labels <- as.matrix(model, type = "label")
## Find closest labels / predict
embedding_combination <- starspace_embedding(model, "federale politie patrouille", type = "document")
embedding_similarity(embedding_combination,
                     embedding_labels,
                     top_n = 3)

term1                      term2 similarity rank
federale politie patrouille           __label__POLITIE  0.8480641    1
federale politie patrouille          __label__OPENBARE  0.6919607    2
federale politie patrouille __label__BEROEPSMOBILITEIT  0.6907637    3
predict(model, "de migranten komen naar europa, in asielcentra ...")
$input
"de migranten komen naar europa, in asielcentra ..."
$prediction
                label               label_starspace similarity
 VLUCHTELINGENCENTRUM __label__VLUCHTELINGENCENTRUM  0.7075160
          VLUCHTELING          __label__VLUCHTELING  0.6253517
             ILLEGALE             __label__ILLEGALE  0.5997692
       MIGRATIEBELEID       __label__MIGRATIEBELEID  0.5939595
           UITWIJZING           __label__UITWIJZING  0.5376520

The list of R packages regarding text mining with R provided by BNOSAC has been steadily growing. This is the list of R packages maintained by BNOSAC.

  • udpipe: tokenisation, lemmatisation, parts of speech tagging, dependency parsing, morphological feature extraction, sentiment scoring, keyword extraction, NLP flows
  • crfsuite: named entity recognition, text classification, chunking, sequence modelling
  • textrank: text summarisation
  • ruimtehol: text classification, word/sentence/document embeddings, document/label similarities, ranking documengs, content based recommendation, collaborative filtering-based recommendation

More details of ruimtehol at the development repository https://github.com/bnosac/ruimtehol where you can also provide feedback.

Training on Text Mining 

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

crfsuite for natural language processing

A new R package called crfsuite supported by BNOSAC landed safely on CRAN last week. The crfsuite package (https://github.com/bnosac/crfsuite) is an R package specific to Natural Language Processing and allows you to easily build and apply models for

  • named entity recognition
  • text chunking
  • part of speech tagging
  • intent recognition or
  • classification of any category you have in mind

The focus of the implementation is on allowing the R user to build such models on his/her own data, with your own categories. The R package is a Rcpp interface to the popular crfsuite C++ package which is used a lot in all kinds of chatbots.

In order to facilitate creating training data on your own data, a shiny app is made available in this R package which allows you to easily tag your own chunks of text, with your own categories, which can next be used to build a crfsuite model. The package also plays nicely together with the udpipe R package (https://CRAN.R-project.org/package=udpipe), which you need in order to extract predictive features (e.g. parts of speech tags) for your words to be used in the crfsuite model.

On a side-note. If you are in the area of NLP, you might also be interested in the upcoming ruimtehol R package which is a wrapper around the excellent StarSpace C++ code providing word/sentence/document embeddings, text-based classification, content-based recommendation and similarities as well as entity relationship completion.

app screenshot

You can get going with the crfsuite package as follows. Have a look at the package vignette, it shows you how to construct and apply your own crfsuite model.

## Install the packages
install.packages("crfsuite")
install.packages("udpipe")

## Look at the vignette
library(crfsuite)
library(udpipe)
vignette("crfsuite-nlp", package = "crfsuite")

More details at the development repository https://github.com/bnosac/crfsuite where you can also provide feedback.

Training on Text Mining 

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

Last call for the course on text mining of next week

Last call for the 2-day course on Text Mining with R, held next week (08-09 October 2018) in Brussels, Belgium. Subscribe at https://www.eventbrite.co.uk/e/dsb2018-text-mining-with-r-jan-wijffels-bnosac-session-03-04-tickets-50586501588

You'll learn during that course the following:

  • Cleaning of text data, regular expressions
  • String distances
  • Graphical displays of text data
  • Natural language processing: stemming, parts-of-speech tagging, tokenization, lemmatisation, dependency parsing, noun phrase detection and keyword extraction
  • Entity recognition & chunking using Conditional Random Fields
  • Sentiment analysis
  • Statistical topic detection modelling (latent dirichlet allocation)
  • Visualisation of correlations & topics
  • Automatic classification using predictive modelling based on text data
  • Word and Text embeddings
  • Document similarities & Text alignment

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 08-09/10/2018: Text mining with R. Brussels (Belgium). Subscribe at https://www.eventbrite.co.uk/e/dsb2018-text-mining-with-r-jan-wijffels-bnosac-session-03-04-tickets-50586501588
  • 15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

udpipe version 0.7 for Natural Language Processing (#NLP) alongside #tidytext, #quanteda, #tm

udpipe rlogoThis blogpost announces the release of the udpipe R package version 0.7 on CRAN. udpipe is an R package which does tokenization, parts of speech tagging, lemmatization, morphological feature tagging and dependency parsing. It's main feature is that it is a lightweight R package which works on more than 50 languages and gives you rich NLP output out of the box. 

The package was updated mainly in order to more easily work with the crfsuite R package which does entity/intent recogntion and chunking. The user-visible changes that were made are that udpipe now has a shorthand for working with text in the TIF format and it now also allows to indicate the location of the token inside the original text. Next to this, version 0.7 also caches the udpipe models.

Example

Using udpipe (version >= 0.7) works as follows. First download the model of your language and next do the annotation.

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
x <- udpipe("De federale regering besliste vandaag dat er een neutrale verpakking voor roltabak en sigaretten komt",
object = udmodel)

Since version 0.7, you can now also directly indicate the language. This will download the udpipe annotation model if it is not already downloaded. Please inspect the help of udpipe_download_model for more details on the languages available and the license of these.

x <- udpipe("Je veux qu’on me juge pour ce que je suis et non pour ce qu’était mon père", "french")
x <- udpipe("Europa lança taxas sobre navios para tirar lixo do fundo do mar.", "portuguese")
x <- udpipe("आपके इस स्नेह्पूर्ण और जोरदार स्वागत से मेरा हृदय आपार हर्ष से भर गया है। मैं आपको दुनिया के सबसे पौराणिक भिक्षुओं की तरफ से धन्यवाद् देता हूँ। मैं आपको सभी धर्मों की जननी कि तरफ से धन्यवाद् देता हूँ, और मैं आपको सभी जाति-संप्रदाय के लाखों-करोड़ों हिन्दुओं की तरफ से धन्यवाद् देता हूँ। मेरा धन्यवाद् उन वक्ताओं को भी जिन्होंने ने इस मंच से यह कहा कि दुनिया में शहनशीलता का विचार सुदूर पूरब के देशों से फैला है।", "hindi")
x <- udpipe("The economy is weak but the outlook is bright", "english")
x <- udpipe("Maxime y su mujer hicieron que nuestra estancia fuera lo mas comoda posible", "spanish")
x <- udpipe("A félmilliárdos MVM-támogatásból 433 milliót négy nyúlfarknyi imázsvideóra költött", "hungarian")
x <- udpipe("同樣,施力的大小不同,引起的加速度不同,最終的結果也不一樣,亦可以從向量的加成性來看", "chinese")

The result is a a data.frame with one row per doc_id and term_id containing all the tokens in the data, the lemma, the part of speech tags, the morphological features, the dependency relationship along the tokens and the location where the token is found in the original text.

screenshot udpipe chinese

Use alongside other R packages

R has a rich NLP ecosystem. If you want to use udpipe alongside other R packages, let's enumerate some basic possibilities where we show how to easily extract lemma's and text of parts of speech tags you are interested in:
Below we show how to use udpipe alongside the 3 popular R packages: tidytext, quanteda and tm on the following data.frame in TIF format.

rawdata <- data.frame(doc_id = c("doc1", "doc2"), 
                      text = c("The economy is weak but the outlook is bright.",
"Natural Language Processing has never been more easy than this."),
                      stringsAsFactors = FALSE)

Using tidytext

In this code, we let tidytext do tokenisation and use udpipe to enrich the token list. Next we subset the data.frame of tokens by extracting only proper nouns, nouns and adjectives.

library(tidytext)
library(udpipe)
library(dplyr)
x <- unnest_tokens(rawdata, input = text, output = word)
x <- udpipe(split(x$word, x$doc_id), "english")
x <- filter(x, upos %in% c("PROPN", "NOUN", "ADJ"))

Using quanteda

In the code below, we let udpipe do tokenisation and provide the lemma's back in quanteda's tokens element.

library(quanteda)
library(udpipe)
x <- corpus(rawdata, textField = "text")
tokens <- udpipe(texts(x), "english")
x$tokens <-  as.tokenizedTexts(split(tokens$lemma, tokens$doc_id))

Using tm

Below, we get only the lemma's of the nouns, proper nouns and adjectives and apply this using the tm_map functionality from tm.

library(tm)
library(udpipe)
x <- VCorpus(VectorSource(rawdata$text))
x <- tm_map(x, FUN=function(txt){
  data <- udpipe(content(txt), "english")
  data <- subset(data, upos %in% c("PROPN", "NOUN", "ADJ"))
  paste(data$lemma, collapse = " ")
})

UDPipe currently already uses deep learning techniques (e.g a GRU network) for doing the tokenisation but the dependency parsing was enhanced in 2018 by incorporating tensorflow. On the roadmap for a next release will be the integration of the UDPipe future enhancements (which got 3rd place at the CoNLL shared task from 2018) including these tensorflow components.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.rtraining

  • 08-09/10/2018: Text mining with R. Brussels (Belgium). http://di-academy.com/bootcamp + send mail to This email address is being protected from spambots. You need JavaScript enabled to view it.
  • 15-16/10/2018: Statistical machine learning with R. Leuven (Belgium). Subscribe here
  • 20-21/11/2018: Text mining with R. Leuven (Belgium). Subscribe here
  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
  •      15/03/2019: Image Recognition with R and Python: Subscribe here
  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here