LANGTenTen: Corpus of the LANG Web

The Thai web corpus (thTenTen) is a Thai corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

The Thai language also called as Ayutthaya or Siamese is the official and national language of Thailand. This Thai corpus was crawled by SpiderLing in August and September 2018. Sources included Thai Web and Thai Wikipedia. Text were tokenised by SWATH (Smart Word Analysis for THai) segmenter (see more http://www.cs.cmu.edu/~paisarn/software.html) and not part-of-speech tagged yet.

For detailed information about TenTen corpora, see Common TenTen corpora attributes.

Tools to work with the Thai corpus

A complete set of tools is available to work with this Thai corpus to generate:

  • word lists – lists of Thai words organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

Thai Web 2018 (thTenTen18)

  • crawled in August and September with initial size 695 million tokens
  • texts only tokenised

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Thai corpus

Sketch Engine offers a range of tools to work with this Thai corpus from the web.

or

Other text corpora

Sketch Engine offers 450+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.