thTenTen: Corpus of the Thai Web
The Thai web corpus (thTenTen) is a Thai corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
The Thai language also called as Ayutthaya or Siamese is the official and national language of Thailand. This Thai corpus was crawled by SpiderLing in August and September 2018. Sources included Thai Web and Thai Wikipedia. Text were tokenised by SWATH (Smart Word Analysis for THai) segmenter (see more http://www.cs.cmu.edu/~paisarn/software.html) and not part-of-speech tagged yet.
For detailed information about TenTen corpora, see Common TenTen corpora attributes.