loTenTen – Lao corpus from the web

loTenTen: Corpus of the Lao Web

The Lao Web Corpus (loTenTen) is a Lao corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

The data were crawled by Spiderling in August and September 2018 and 2019 from the following sources: Lao Wikipedia, Lao web. Texts were tokenized using our in-house segmenter and tagged using the in-house RFTagger model.

For detailed information about TenTen corpora, see Common TenTen corpora attributes.

Part-of-speech tagset

This Lao corpus was tagged using the PAN localization part-of-speech tags.

loTenTen corpus in detail

Basic statistics information about the Lao Web Corpus 2019.

	Frequency
Tokens	121,266,009
Words	105,018,584
Sentences	5,782,107
Web pages	1,307,516

Tools to work with the Lao corpus

A complete set of tools is available for working with this Lao corpus and generating:

word sketch – Lao collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word units
word lists – lists of Lao nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency lists of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

Lao Web 2019 (loTenTen19)

6th version (July 2021)

processed semi-automatic revised attributes into standard attributes

4th version (June 2020)

corpus size 121 million tokens
tokenized by in-house segmenter
part-of-speech tagged by RFTagger model
revised attributes – semi-automatically corrected

Lao Web 2018 (loTenTen18)

1st version (October 2018)

crawled data in the size of 17.4 million tokens
tokenized, not tagged

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Processing Lao data

V. Baisa, M. Blahuš, M. Cukr, O. Herman, M. Jakubíček, Kovář. V., Měchura Medveď, P. Rychlý, V. Suchomel. Automating dictionary production: a Tagalog-English-Korean dictionary from scratch. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2019. [Download PDF]

Blahuš, M., Cukr, M., Herman, O., Jakubíček, M., Kovář. V. Medveď, M. Semi-automatic building of large-scale digital dictionaries. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2021.

Search the Lao corpus

Sketch Engine offers a range of tools to work with this Laotian corpus from the web.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

loTenTen: Corpus of the Lao Web

Part-of-speech tagset

loTenTen corpus in detail

Tools to work with the Lao corpus

Lao Web 2019 (loTenTen19)

Lao Web 2018 (loTenTen18)

Search the Lao corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine