A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense etc.) of each token in a text corpus.

RDRPOSTagger Khmer part-of-speech tagset

This Khmer part-of-speech tagset is available in Khmer corpora annotated by the tool RDRPOSTagger (A Ripple Down Rules-based Part-Of-Speech Tagger) that is a language-independent toolkit.

Khmer corpora

Khmer corpora available in Sketch Engine.


Khmer part-of-speech tagset legend

The following table shows a list of Khmer part-of-speech tags available in Khmer corpora tagged by RDRPOSTagger.

An Example of a tag in the CQL concordance search box: [tag="NN] finds all nouns, e.g. ច្បាប់, វប្បធម៌ (note: please make sure that you use straight double quotation marks)

PoS Tag Description Example
AB abbreviation នៅស.
AUX auxiliary verb មាន + Verb
CC conjunction បើ
CUR currency
CD cardinal number លាន
DBL double sign
DT determiner សព្វ
ETC et cetera ។ល។
IN preposition, subordinating conjunction ដល់
JJ adjective ប្លែក
KAN full stop ។, ៕
M measure word នាក់
NN noun វប្បធម៌
PA particle នូវ
PN proper noun ភ្នំពេញ
PRO pronoun គាត់
QT question word តើ
RB adverb ហើយ
RPN relative pronoun ដែល
SYM symbol . ” ,
UH interjection ប្លែក
VB verb ស្តាប់
VB_JJ adjective from verb យោង
VCOM verb complement សល់

source: https://github.com/ye-kyaw-thu/khPOS/blob/master/README.md


Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, pp. 17-20, 2014. [.PDF] [.bib]