HindiWaC: Hindi Corpus from the web
The Hindi Web corpus (HindiWaC) is a Hindi corpus made up of texts collected from the Internet. This corpus contains more than 100 million words crawled from the Hindi Internet during 2012.
Texts in the corpus are lemmatized and morphologically tagged. The corpus has a word sketch grammar enables users to explore the grammatical and collocational behavior of Hindi words. The whole process corpus preparation is described in the Corpus factory method document (Kilgarriff et al. at LREC 2010).
See the Hindi part-of-speech tagset describing POS tags used in the corpus.
Special positional attributes in the 3rd version of the corpus
- cpos – coarse POS tag that it is not derived from the attribute tag, see more in section 4.1 of tagset description (below)
Attributes only in the 3rd version of the corpus
- hlemma/hword (heuristic) – tags where all the vowels are stripped, and just the consonants appear. Most spelling variations are due to the usage of differents vowels, so in order to find similarly spelt words hlemma and hword becomes handy, e.g. ka (क) + e -> ki की
- Tags with suffix “:?” are words which cannot be classified into the target tag linguistically but had to be classified due to the context