Tagset for Indian Languages (Bengali, Hindi, ...)

A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense, etc.) of each token in a text corpus.

Part-of-speech tagset for Indian Languages such as Bengali, Hindi, Kannada, Telugu, etc. created in terms of the Indian Language Machine Translation (ILMT) project comprising various Indian languages.

List of corpora

available in Sketch Engine

What is PoS tag?

An Example of a tag in the CQL concordance search box: [tag="NN.*|NST"] finds all nouns, e.g. ಮೇಲೆ, ಬಗ್ಗೆ (note: please make sure that you use straight double quotation marks)

Tagset

PoS Tag	Description	Note/Example
CC	Conjunction (co-ordinating and subordinating)	bole (Bangla)
CL	Classifier
DEM	Demonstrative
ECH	Echo word
INJ	Interjection
INTF	Intensifier
JJ	Adjective
NEG	Negation
NN	Noun
NNP	Proper noun
NST	Noun denoting spatial or temporal expressions
PRP	Pronoun
PSP	Postposition
QC	Cardinal number
QF	Quantifier	bahut, tho.DA, kam (Hindi)
QO	Ordinal number
RB	Adverb	*Only manner verb
RDP	Reduplication
RP	Particle	bhI, to, hI, jI, hA.N, na,
SYM	Special symbol
UNK	Unknown
UT	Quotative	ani (Telugu), endru (Tamil), bole/mAne (Bangla), mhaNaje (Marathi), mAne (Hindi)
VAUX	Verb Auxiliary
VM	Verb Main
WQ	Question Word	c
*C (XC)	compound	where X is a variable of the type of the compound of which the current word is a member of

Source: crawled from Wayback Machine at http://ltrc.iiit.ac.in/tr031/posguidelines.pdf

Hindi part-of-speech tagset scheme in detail

Each PoS tag is composed of the main PoS tag written in capital letters (e.g. NN – noun) and five further categories separated by a dot providing detailed information about the particular token. Unused categories are replaced with a dot (e.g. NNP.unk.… – proper noun unknown).

For example, a noun tag NN.n.m.sg.3.d consists of the following categories and their values.

category	value	description
main PoS tag	NN	noun
coarse PoS tag	n	noun
gender	m	masculine
number	sg	singular
person	3	the third person
case	d	direct

To find all possible main PoS tags, see the list above.

The list of coarse POS tags follows:

value	description
adj	adjective
adv	adverb
avy	avvya – indeclinable and some functional words, e.g. या
n	noun
num	numeral
pn	pronoun
psp	postposition
punc	punctuation
unk	unknown
v	verb

The list of values of the gender category:

value	description
any	any gender
f	feminine
m	masculine
n	neuter
punc	punctuation
.	not applicable

The list of values of the number category:

value	description
any	any number
pl	plural
sg	singular
.	not applicable

The list of values of the person category:

value	description
any	any person
1	the first person
2	the second person
2h	the second person honorific
3	the third person
.	not applicable

The list of values of the case category:

value	description
any	any case
d	direct
o	oblique
.	not applicable

Source: https://bitbucket.org/sivareddyg/hindi-part-of-speech-tagger/src/master/README.md

Tagset

Hindi part-of-speech tagset scheme in detail

Corpora of Indian languages

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine