loTenTen: Corpus of the Lao Web
The Lao Web Corpus (loTenTen) is a Lao corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
The data were crawled by Spiderling in August and September 2018 and 2019 from the following sources: Lao Wikipedia, Lao web. Texts were tokenized using our in-house segmenter and tagged using the in-house RFTagger model.
For detailed information about TenTen corpora, see Common TenTen corpora attributes.
This Lao corpus was tagged using the PAN localization part-of-speech tags.
loTenTen corpus in detail
Basic statistics information about the Lao Web Corpus 2019.
Tools to work with the Lao corpus
A complete set of tools is available to work with this Lao corpus to generate:
- word sketch – Lao collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word units
- word lists – lists of Lao nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Lao Web 2019 (loTenTen19)
6th version (July 2021)
- processed semi-automatic revised attributes into standard attributes
4th version (June 2020)
- corpus size 121 million tokens
- tokenized by in-house segmenter
- part-of-speech tagged by RFTagger model
- revised attributes – semi-automatically corrected
Lao Web 2018 (loTenTen18)
1st version (October 2018)
- crawled data in the size of 17.4 million tokens
- tokenized, not tagged
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Processing Lao data
V. Baisa, M. Blahuš, M. Cukr, O. Herman, M. Jakubíček, Kovář. V., Měchura Medveď, P. Rychlý, V. Suchomel. Automating dictionary production: a Tagalog-English-Korean dictionary from scratch. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2019. [Download PDF]
Blahuš, M., Cukr, M., Herman, O., Jakubíček, M., Kovář. V. Medveď, M. Semi-automatic building of large-scale digital dictionaries. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2021.
Search the Lao corpus
Sketch Engine offers a range of tools to work with this Laotian corpus from the web.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.