loTenTen: Corpus of the Lao Web
The Lao Web Corpus (loTenTen) is a Lao corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
The data were crawled by Spiderling in August and September 2018 and 2019 from the following sources: Lao Wikipedia, Lao web. Texts were tokenised using our in-house segmenter and tagged using in-house RFTagger model.
For detailed information about TenTen corpora, see Common TenTen corpora attributes.
This Lao corpus was tagged using using the PAN localization part-of-speech tags.