bnWaC: Bengali corpus from the web
The Bangla web Corpus (bnWaC) is a Bengali corpus made up of texts collected from the Internet. The corpus was prepared by the Corpus factory method, a method for developing large general language corpora which can be applied to many languages (A. Kilgarriff et al 2010). Corpus texts are available with lemmatization and POS tagging.
Part-of-speech tagset
This Bangla corpus has the part-of-speech tagset for Bengali created using an Annotation tool developed in Microsoft Research India.
Tools work with the Bengali corpus
A complete set of Sketch Engine tools is available to work with this Bengali web corpus to generate:
- word lists – lists of Bengali nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
version 2 (2017 April)
- applied new tokenizer and lemmatizer
- corpus tagged using an Annotation Tool developed in Microsoft Research India
(2012)
- shallow tagging (no part-of-speech classification)
- universal word sketch grammar
version 1 2010
- initial version
Bibliography
Indian Language Part-of-Speech Tagset: Bengali
Bali, Kalika, Monojit Choudhury, and Priyanka Biswas. Indian Language Part-of-Speech Tagset: Bengali LDC2010T16. Web Download. Philadelphia: Linguistic Data Consortium, 2010.
Web corpora
BARONI, Marco; KILGARRIFF, Adam. Large linguistically-processed web corpora for multiple languages. In: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics, 2006, pp. 87–90.
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. A corpus factory for many languages. In LREC, May 2010.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.