bnWaC: Bengali corpus from the web

The Bangla web Corpus (bnWaC) is a Bengali corpus made up of texts collected from the Internet. The corpus was prepared by the Corpus factory method, a method for developing large general language corpora which can be applied to many languages (A. Kilgarriff et al 2010). Corpus texts are available with lemmatization and POS tagging.

Part-of-speech tagset

This Bangla corpus has the part-of-speech tagset for Bengali created using an Annotation tool developed in Microsoft Research India.

Tools work with the Bengali corpus

A complete set of Sketch Engine tools is available to work with this Bengali web corpus to generate:

  • word lists – lists of Bengali nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

version 2 (2017 April)

  • applied new tokenizer and lemmatizer
  • corpus tagged using an Annotation Tool developed in Microsoft Research India


  • shallow tagging (no part-of-speech classification)
  • universal word sketch grammar

version 1 2010

  • initial version

Indian Language Part-of-Speech Tagset: Bengali

Bali, Kalika, Monojit Choudhury, and Priyanka Biswas. Indian Language Part-of-Speech Tagset: Bengali LDC2010T16. Web Download. Philadelphia: Linguistic Data Consortium, 2010.

Web corpora

BARONI, Marco; KILGARRIFF, Adam. Large linguistically-processed web corpora for multiple languages. In: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics, 2006, pp. 87–90.

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. A corpus factory for many languages. In LREC, May 2010.

Search the Bengali corpus

Sketch Engine offers a range of tools to work with this Bengali corpus.

Concordance from Bengali corpus.

Other text corpora

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.