Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger
Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, ... . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.
- The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
- Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
- OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
- Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
- SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.
Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:
- Easily installable in a corporate environment as a simple R package based on rJava
- Covering more than 40 languages:
UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese - Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
- Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
- Cross-platform running on Windows/Linux/Mac
- It allows to do the Morphological, POS tagging and universal POS tagging of sentences
The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ' A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging' available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:
library(RDRPOSTagger)
rdr_available_models()
## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)
## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
"Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
" ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)
## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)
## This gives the following output
sentence.id word.id word word.type
1 1 Dus ADV
1 2 godvermehoeren VERB
1 3 met ADP
1 4 pus NOUN
1 5 in ADP
1 6 alle PRON
1 7 puisten NOUN
1 8 , PUNCT
1 9 zei VERB
1 10 die PRON
1 11 schele ADJ
1 12 van ADP
1 13 Van PROPN
1 14 Bukburg PROPN
1 15 . PUNCT
2 1 Er ADV
2 2 was AUX
2 3 toen SCONJ
2 4 dat SCONJ
2 5 liedje NOUN
2 6 van ADP
2 7 tietenkonttieten VERB
2 8 kont PROPN
2 9 tieten VERB
2 10 kontkontkont PROPN
2 11 . PUNCT
3 0 <NA> <NA>
4 0 <NA> <NA>
5 0 <NA> <NA>
The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.
Good luck with text mining.
If you need our help for a text mining project. Let us know, we'll be glad to get you started.