A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense, etc.) of each token in a text corpus.
Part-of-speech tagset for Indian Languages such as Bengali, Hindi, Kannada, Telugu, etc. created in terms of the Indian Language Machine Translation (ILMT) project comprising various Indian languages.
An Example of a tag in the CQL concordance search box: [tag="NN.*|NST
"]
finds all nouns, e.g. ಮೇಲೆ, ಬಗ್ಗೆ (note: please make sure that you use straight double quotation marks)
Tagset
PoS Tag | Description | Note/Example |
---|---|---|
CC | Conjunction (co-ordinating and subordinating) | bole (Bangla) |
CL | Classifier | |
DEM | Demonstrative | |
ECH | Echo word | |
INJ | Interjection | |
INTF | Intensifier | |
JJ | Adjective | |
NEG | Negation | |
NN | Noun | |
NNP | Proper noun | |
NST | Noun denoting spatial or temporal expressions | |
PRP | Pronoun | |
PSP | Postposition | |
QC | Cardinal number | |
QF | Quantifier | bahut, tho.DA, kam (Hindi) |
QO | Ordinal number | |
RB | Adverb | *Only manner verb |
RDP | Reduplication | |
RP | Particle | bhI, to, hI, jI, hA.N, na, |
SYM | Special symbol | |
UNK | Unknown | |
UT | Quotative | ani (Telugu), endru (Tamil), bole/mAne (Bangla), mhaNaje (Marathi), mAne (Hindi) |
VAUX | Verb Auxiliary | |
VM | Verb Main | |
WQ | Question Word | c |
*C (XC) | compound | where X is a variable of the type of the compound of which the current word is a member of |
Source: crawled from Wayback Machine at http://ltrc.iiit.ac.in/tr031/posguidelines.pdf
Hindi part-of-speech tagset scheme in detail
Each PoS tag is composed of the main PoS tag written in capital letters (e.g. NN – noun) and five further categories separated by a dot providing detailed information about the particular token. Unused categories are replaced with a dot (e.g. NNP.unk.… – proper noun unknown).
For example, a noun tag NN.n.m.sg.3.d consists of the following categories and their values.
category | value | description |
main PoS tag | NN | noun |
coarse PoS tag | n | noun |
gender | m | masculine |
number | sg | singular |
person | 3 | the third person |
case | d | direct |
To find all possible main PoS tags, see the list above.
The list of coarse POS tags follows:
value | description |
adj | adjective |
adv | adverb |
avy | avvya – indeclinable and some functional words, e.g. या |
n | noun |
num | numeral |
pn | pronoun |
psp | postposition |
punc | punctuation |
unk | unknown |
v | verb |
The list of values of the gender category:
value | description |
any | any gender |
f | feminine |
m | masculine |
n | neuter |
punc | punctuation |
. | not applicable |
The list of values of the number category:
value | description |
any | any number |
pl | plural |
sg | singular |
. | not applicable |
The list of values of the person category:
value | description |
any | any person |
1 | the first person |
2 | the second person |
2h | the second person honorific |
3 | the third person |
. | not applicable |
The list of values of the case category:
value | description |
any | any case |
d | direct |
o | oblique |
. | not applicable |
Source: https://bitbucket.org/sivareddyg/hindi-part-of-speech-tagger/src/master/README.md
or