loTenTen: Corpus of the Lao Web

The Lao Web Corpus (loTenTen) is a Lao corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages. The data were crawled by Spiderling in August and September 2018 from the following sources: Lao Wikipedia, Lao web. Texts are tokenised using Polyglot.

For detailed information about TenTen corpora, see  Common TenTen corpora attributes.

Part-of-speech tagset

The Laotian corpus was not tagged yet.

Tools to work with the Lao corpus

A complete set of tools is available to work with this Lao corpus to generate:

  • keywords – terminology extraction of one-word units
  • word lists – lists of Lao nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

1st version (October 2018)

  • crawled data in the size of 17.4 million tokens
  • tokenised, not tagged & lemmatized yet

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Lao corpus

Sketch Engine offers a range of tools to work with this Laotian corpus from the web.

or

Other text corpora

Sketch Engine offers 450+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.