udpipe R package updated

An update of the udpipe R package (https://bnosac.github.io/udpipe/en) landed safely on CRAN last week. Originally the udpipe R package was put on CRAN in 2017 wrapping the UDPipe (v1.2 C++) tokeniser/lemmatiser/parts of speech tagger and dependency parser. It now has many more functionalities next to just providing this parser.

The current release (0.8.4-1 on CRAN: https://cran.r-project.org/package=udpipe) makes sure default models which are used are the ones trained on version 2.5 of universal dependencies. Other features of the release are detailed in the NEWS item. This is what dependency parsing looks like on some sample text.

library(udpipe)
x <- udpipe("The package provides a dependency parsers built on data from universaldependencies.org", "english")
View(x)
library(ggraph)
library(ggplot2)
library(igraph)
library(textplot)
plt <- textplot_dependencyparser(x, size = 4, title = "udpipe R package - dependency parsing")
plt

udpipe parser plot

During the years, the toolkit has now also incorporated many functionalities for commonly used data manipulations on texts which are enriched with the output of the parser.  Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations,  information retrieval metrics, handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.

Many add-on R packages

The udpipe package is loosely coupled with other NLP R packages which BNOSAC released in the last 4 years on CRAN. Loosely coupled means that none of the packages have hard dependencies of one another making it easy to install and maintain and allowing you to use only the packages and tools that you want.

Hereby a small list of loosely coupled packages by BNOSAC which contain functions and documentation where the udpipe package is used as a preprocessing step.

- BTM: Biterm Topic Modelling
- crfsuite: Build named entity recognition models using conditional random fields
- nametagger: Build named entity recognition models using markov models
- torch.ner: Named Entity Recognition using torch
- word2vec: Training and applying the word2vec algorithm
- ruimtehol: Text embedding techniques using Starspace
- textrank: Text summarisation and keyword detection using textrank
- brown: Brown word clustering on texts
- sentencepiece: Byte Pair Encoding and Unigram tokenisation using sentencepiece
- tokenizers.bpe: Byte Pair Encoding tokenisation using YouTokenToMe
- text.alignment: Find text similarities using Smith-Waterman
- textplot: Visualise complex relations in texts

textplot example

Model building example

To showcase the loose integration, let's use the udpipe package alongside the word2vec package to build a udpipe model by ourselves on the German GSD treebank which is described at https://universaldependencies.org/treebanks/de_gsd/index.html and contains a set of CC BY-SA licensed annotated texts from news articles, wiki entries and reviews.
More information at https://universaldependencies.org.

Download the treebank.

library(utils)
settings <- list()
settings$ud.train    <- "https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/r2.6/de_gsd-ud-train.conllu"
settings$ud.dev      <- "https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/r2.6/de_gsd-ud-dev.conllu"
settings$ud.test     <- "https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/r2.6/de_gsd-ud-test.conllu"
## Download the conllu files
download.file(url = settings$ud.train, destfile = "train.conllu")
download.file(url = settings$ud.dev,   destfile = "dev.conllu")
download.file(url = settings$ud.test,  destfile = "test.conllu")

Build a word2vec model using out R package word2vec

  • Create wordvectors on the downloaded training dataset as these are used for training the dependency parser
  • Save the word vectors to disk
  • Inspect a bit the word2vec model by showing similarities to some German words
library(udpipe)
library(word2vec)
txt <- udpipe_read_conllu("train.conllu")
txt <- paste.data.frame(txt, term = "token", group = c("doc_id", "paragraph_id", "sentence_id"), collapse = " ")
txt <- txt$token
w2v <- word2vec(txt, type = "skip-gram", dim = 50, window = 10, min_count = 2, negative = 5, iter = 15, threads = 1)
write.word2vec(w2v, file = "wordvectors.vec", type = "txt", encoding = "UTF-8")
predict(w2v, c("gut", "freundlich"), type = "nearest", top = 20)

And train the model

  • Using the hyperparameters for the tokeniser, parts of speech tagger & lemmatizer and the dependency parser as shown here: https://github.com/bnosac/udpipe/tree/master/inst/models-ud-2.5
  • Note that model training takes a while (8hours up to 3days) depending on the size of the treebank and your hyperparameter settings. This example was run on a Windows i5 CPU laptop with 1.7Ghz, so no GPU needed, which makes this model building process still accessible for anyone with a simple PC.
print(Sys.time())
m <- udpipe_train(file = "de_gsd-ud-2.6-20200924.udpipe",
                  files_conllu_training = "train.conllu",
                  files_conllu_holdout  = "dev.conllu",
                  annotation_tokenizer = list(dimension = 64, epochs = 100, segment_size=200, initialization_range = 0.1,
                                              batch_size = 50, learning_rate = 0.002, learning_rate_final=0, dropout = 0.1,
early_stopping = 1),
                  annotation_tagger = list(models = 2,
                                           templates_1 = "lemmatizer",
guesser_suffix_rules_1 = 8, guesser_enrich_dictionary_1 = 4,
guesser_prefixes_max_1 = 4,
                                           use_lemma_1 = 1,provide_lemma_1 = 1, use_xpostag_1 = 0, provide_xpostag_1 = 0,
                                           use_feats_1 = 0, provide_feats_1 = 0, prune_features_1 = 1,
                                           templates_2 = "tagger",
guesser_suffix_rules_2 = 8, guesser_enrich_dictionary_2 = 4,
guesser_prefixes_max_2 = 0,
                                           use_lemma_2 = 1, provide_lemma_2 = 0, use_xpostag_2 = 1, provide_xpostag_2 = 1,
                                           use_feats_2 = 1, provide_feats_2 = 1, prune_features_2 = 1),
                  annotation_parser = list(iterations = 30,
embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0,
                                           embedding_form = 50, embedding_form_file = "wordvectors.vec",
                                           embedding_lemma = 0, embedding_deprel = 20, learning_rate = 0.01,
                                           learning_rate_final = 0.001, l2 = 0.5, hidden_layer = 200,
                                           batch_size = 10, transition_system = "projective", transition_oracle = "dynamic",
                                           structured_interval = 8))
print(Sys.time())

You can see the logs of this run here. Now your model is ready, you can use it on your own terms and you can start using it to annotate your text.

model <- udpipe_load_model("de_gsd-ud-2.6-20200924.udpipe")
texts <- data.frame(doc_id = c("doc1", "doc2"), text = c("Die Wissenschaft ist das beste, was wir haben.", "Von dort war Kraftstoff in das Erdreich gesickert."), stringsAsFactors = FALSE)
anno <- udpipe(texts, model, trace = 10)
View(anno)

udpipe parser table

Enjoy!

Thanks to Slav Petrov, Wolfgang Seeker, Ryan McDonald, Joakim Nivre, Daniel Zeman, Adriane Boyd for creating and distributing the UD_German-GSD treebank and to the UDPipe authors in particular Milan Straka.

finding contour lines

Finally, the R package you all have been waiting for has arrived - image.ContourDetector developed at https://github.com/bnosac/image. It detects contour lines in images alongside the 'Unsupervised Smooth Contour Detection' algorithm available at http://www.ipol.im/pub/art/2016/175.

Have you always wanted to be able to draw like you are in art school? Let me show how to quickly do this.

example contourlines

If you want to reproduce this, the following snippets show how. Steps are as follows

1. Install the packages from CRAN

install.packages("image.ContourDetector")
install.packages("magick")
install.packages("sp")

2. Get an image, put it into grey scale, pass the pixels to the function an off you go.

library(magick)
library(image.ContourDetector)
library(sp)
img <- image_read("https://cdn.mos.cms.futurecdn.net/9sUwFGNJvviJks7jNQ7AWc-1200-80.jpg")
mat <- image_data(img, channels = "gray")
mat <- as.integer(mat, transpose = TRUE)
mat <- drop(mat)
contourlines <- image_contour_detector(mat)
plt <- plot(contourlines)
class(plt)

example contourlines linesonly

3. If you want to have the same image as shown at the top of the article:

Put the 3 images (original, combined, contour lines only) together in 1 plot using the excellent magick R package:

plt <- image_graph(width = image_info(img)$width, height = image_info(img)$height)
plot(contourlines)
dev.off()
plt_combined <- image_graph(width = image_info(img)$width, height = image_info(img)$height)
plot(img)
plot(contourlines, add = TRUE, col = "red", lwd = 5)
dev.off()
combi <- image_append(c(img, plt_combined, plt))
combi
image_write(combi, "example-contourlines.png", format = "png")

Text Plots

A few weeks ago, we pushed R package textplot to CRAN and it was accepted for release last week. The package contains straightforward functionalities for the visualisation of text, namely of 

  • text cooccurences
  • text clusters (in casu biterm clusters)
  • dependency parsing results
  • text correlations and text frequencies

Some examples of these plots are shown in the gif.

examples textplot

More details can be found in the pdf presentation shown below.

Loading...


Enjoy.

Biterm topic modelling for short texts

A few weeks ago, we published an update of the BTM (Biterm Topic Models for text) package on CRAN.

Biterm Topic Models are especially usefull if you want to find topics in collections of short texts. Short texts are typically a twitter message, a short answer on a survey, the title of an email, search questions, ... . For these types of short texts traditional topic models like Latent Dirichlet Allocation are less suited as most information is available in short word combinations. The R package BTM finds topics in such short texts by explicitely modelling word-word co-occurrences (biterms) in a short window.

The update which was pushed to CRAN a few weeks ago now allows to explicitely provide a set of biterms to cluster upon. Let us show an example on clustering a subset of R package descriptions on CRAN. The resulting cluster visualisation looks like this.

biterm topic model example

If you want to reproduce this, the following snippets show how to do this. Steps are as follows

1. Get some data of R packages and their description in plain text

## Get list of packages in the NLP/Machine Learning Task Views
library(ctv)
pkgs <- available.views()
names(pkgs) <- sapply(pkgs, FUN=function(x) x$name)
pkgs <- c(pkgs$NaturalLanguageProcessing$packagelist$name, pkgs$MachineLearning$packagelist$name)

## Get package descriptions of these packages
library(tools)
x <- CRAN_package_db()
x <- x[, c("Package", "Title", "Description")]
x$doc_id <- x$Package
x$text   <- tolower(paste(x$Title, x$Description, sep = "\n"))
x$text   <- gsub("'", "", x$text)
x$text   <- gsub("<.+>", "", x$text)
x <- subset(x, Package %in% pkgs)

2. Use the udpipe R package to perform Parts of Speech tagging on the package title and descriptions and use udpipe as well for extracting cooccurrences of nouns, adjectives and verbs within 3 words distance.

library(udpipe)
library(data.table)
library(stopwords)
anno <- udpipe(x, "english", trace = 10)
biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma,
                                  relevant = upos %in% c("NOUN", "ADJ", "VERB") &
nchar(lemma) > 2 & !lemma %in% stopwords("en"),
                                  skipgram = 3),
                   by = list(doc_id)]

3. Build the biterm topic model with 9 topics and provide the set of biterms to cluster upon

library(BTM)
set.seed(123456)
traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "VERB") & !lemma %in% stopwords("en") & nchar(lemma) > 2)
traindata <- traindata[, c("doc_id", "lemma")]
model     <- BTM(traindata, biterms = biterms, k = 9, iter = 2000, background = TRUE, trace = 100)

4. Visualise the biterm topic clusters using the textplot package available at https://github.com/bnosac/textplot. This creates the plot show above.

library(textplot)
library(ggraph)
plot(model, top_n = 10,
     title = "BTM model", subtitle = "R packages in the NLP/Machine Learning task views",
     labels = c("Garbage", "Neural Nets / Deep Learning", "Topic modelling",
"Regression/Classification Trees/Forests", "Gradient Descent/Boosting",
"GLM/GAM/Penalised Models", "NLP / Tokenisation",
                "Text Mining Frameworks / API's", "Variable Selection in High Dimensions"))

 

Enjoy!