HindiWaC: Hindi Corpus from the web

The Hindi Web corpus (HindiWaC) is a Hindi corpus made up of texts collected from the Internet. This corpus contains more than 100 million words crawled from the Hindi Internet during 2009, 2011 and 2014.

Texts in the corpus are lemmatized and morphologically tagged. The corpus has a word sketch grammar enables users to explore the grammatical and collocational behavior of Hindi words. The whole process corpus preparation is described in the Corpus factory method document (Kilgarriff et al. at LREC 2010).

The corpus contains a special attribute cpos which is a coarse POS tag that it is not derived from the attribute tag.

Part-of-speech tagset

See the Hindi part-of-speech tagset describing POS tags used in the corpus.

Attributes only in the 3rd version of the corpus

  • hlemma/hword (heuristic) – tags where all the vowels are stripped, and just the consonants appear. Most spelling variations are due to the usage of differents vowels, so in order to find similarly spelt words hlemma and hword becomes handy, e.g. ka (क) + e -> ki की
  • Tags with suffix “:?” are words which cannot be classified into the target tag linguistically but had to be classified due to the context

Tools to work with the Hindi corpus

A complete set of tools is available to work with this HindiWaC corpus to generate:

  • word sketch – Hindi collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of Hindi nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

v4.0 (10th Feb 2017)

  • added data from 2014 with the total size 107 million words
  • improved sketch grammar
  • removed special positional attributes: hlemma and hword

v3.0 (17th Jan 2012)

  • recollected in 2011, size 58 million tokens
  • tagged with using the shallow tagging legend
  • afterward, retagged using a new POS tagger (91.31% accuracy) and lemmatized; lemmatizer and POS analyzer available at http://sivareddy.in/downloads
  • written a simple sketch grammar for Hindi and generated first word sketches for Hindi
  • in 2014 Sketch Grammar revised with new rules making use of post-position markers (which are crucial in Hindi dependency parsing) and added more rules (see more in the bibliography)
  • added lempos attribute
  • special positional attributes: hlemma, hword, and cpos

v1.0 (dec 2009)

  • initial size 27 million words
  • created by Siva Reddy
  • no part-of-speech tagging

Eragani, A. K., Kuchibhotla, V., Sharma, D. M., Reddy, S., & Kilgarriff, A. (2014). Hindi Word Sketches. In Proceedings the 11th International Conference on Natural Language Processing (ICON).

Search the Hindi corpus

Sketch Engine offers a range of tools to work with this Hindi corpus (HindiWaC).

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.