Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, ... . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.

  • The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
  • Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
  • OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
  • Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
  • SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.

Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:

zinsontledingmogelijk

  1. Easily installable in a corporate environment as a simple R package based on rJava
  2. Covering more than 40 languages:
    UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
    MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
    POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
  3. Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
  4. Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
  5. Cross-platform running on Windows/Linux/Mac
  6. It allows to do the Morphological, POS tagging and universal POS tagging of sentences

The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ' A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging' available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:

library(RDRPOSTagger)
rdr_available_models()

## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
       "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
       "  ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

## This gives the following output
sentence.id word.id             word word.type
           1       1              Dus       ADV
           1       2   godvermehoeren      VERB
           1       3              met       ADP
           1       4              pus      NOUN
           1       5               in       ADP
           1       6             alle      PRON
           1       7          puisten      NOUN
           1       8                ,     PUNCT
           1       9              zei      VERB
           1      10              die      PRON
           1      11           schele       ADJ
           1      12              van       ADP
           1      13              Van     PROPN
           1      14          Bukburg     PROPN
           1      15                .     PUNCT
           2       1               Er       ADV
           2       2              was       AUX
           2       3             toen     SCONJ
           2       4              dat     SCONJ
           2       5           liedje      NOUN
           2       6              van       ADP
           2       7 tietenkonttieten      VERB
           2       8             kont     PROPN
           2       9           tieten      VERB
           2      10     kontkontkont     PROPN
           2      11                .     PUNCT
           3       0             <NA>      <NA>
           4       0             <NA>      <NA>
           5       0             <NA>      <NA>

The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.

Good luck with text mining.
If you need our help for a text mining project. Let us know, we'll be glad to get you started.