viTenTen: Corpus of the Vietnamese Web
The Vietnamese Web Corpus (viTenTen) is a Vietnamese corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
For detailed information about TenTen corpora, see Common TenTen corpora attributes.
Part-of-speech tagset
This Vietnamese corpus was not part-of-speech tagged yet.
Tools to work with the Vietnamese corpus
A complete set of tools is available to work with this Vietnamese corpus to generate:
- word lists – lists of Vietnamese words organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
Vietnamese Web 2017 (November 2018)
- crawled by SpiderLing in November and December 2017 and in January 2018
- not tagged yet
Bibliography
TenTen corpora
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Search the Vietnamese corpus
Sketch Engine offers a range of tools to work with this Vietnamese corpus from the web.
or
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.