HindiWaC: Hindi Corpus from the web
The Hindi Web corpus (HindiWaC) is a Hindi corpus made up of texts collected from the Internet. This corpus contains more than 100 million words crawled from the Hindi Internet during 2009, 2011 and 2014.
Texts in the corpus are lemmatized and morphologically tagged. The corpus has a word sketch grammar enables users to explore the grammatical and collocational behavior of Hindi words. The whole process corpus preparation is described in the Corpus factory method document (Kilgarriff et al. at LREC 2010).
The corpus contains a special attribute cpos which is a coarse POS tag that it is not derived from the attribute tag.
See the Hindi part-of-speech tagset describing POS tags used in the corpus.