A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense, etc.) of each token in a text corpus.

Part-of-speech tagset for Indian Languages such as Bengali, Hindi, Kannada, Telugu, etc. created in terms of the Indian Language Machine Translation (ILMT) project comprising various Indian languages.

An Example of a tag in the CQL concordance search box[tag="NN.*|NST"] finds all nouns, e.g. ಮೇಲೆ, ಬಗ್ಗೆ  (note: please make sure that you use straight double quotation marks)


PoS Tag Description Note/Example
CC Conjunction (co-ordinating and subordinating) bole (Bangla)
CL Classifier  
DEM Demonstrative  
ECH Echo word  
INJ Interjection  
INTF Intensifier  
JJ Adjective  
NEG Negation  
NN Noun  
NNP Proper noun
NST Noun denoting spatial or temporal expressions  
PRP Pronoun  
PSP Postposition  
QC Cardinal number  
QF Quantifier bahut, tho.DA, kam (Hindi)
QO Ordinal number  
RB Adverb *Only manner verb
RDP Reduplication  
RP Particle bhI, to, hI, jI, hA.N, na,
SYM Special symbol  
UNK Unknown  
UT Quotative ani (Telugu), endru (Tamil), bole/mAne (Bangla), mhaNaje (Marathi), mAne (Hindi)
VAUX Verb Auxiliary  
VM Verb Main
WQ Question Word  

Source: crawled from Wayback Machine at http://ltrc.iiit.ac.in/tr031/posguidelines.pdf

Corpora of Indian languages

Sketch Engine offers dozens of corpora of Indian languages.