loTenTen: Corpus of the Lao Web

The Lao Web Corpus (loTenTen) is a Lao corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

The data were crawled by Spiderling in August and September 2018 and 2019 from the following sources: Lao Wikipedia, Lao web. Texts were tokenised using our in-house segmenter and tagged using in-house RFTagger model.

For detailed information about TenTen corpora, see  Common TenTen corpora attributes.

Part-of-speech tagset

This Lao corpus was tagged using using the PAN localization part-of-speech tags.

Tools to work with the Lao corpus

A complete set of tools is available to work with this Lao corpus to generate:

  • keywords – terminology extraction of one-word units
  • word lists – lists of Lao nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

1st version (October 2018)

  • crawled data in the size of 17.4 million tokens
  • tokenised, not tagged & lemmatized yet

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Lao corpus

Sketch Engine offers a range of tools to work with this Laotian corpus from the web.


Other text corpora

Sketch Engine offers 450+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.