The Turkmen Web Corpus (tkWaC) is a Turkmen corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). Data were downloaded in January 2012 with the total size 2 million words. Texts were cleaned and deduplicated. The Turkmen language belongs to the Turkic languages.
Tools to work with the Turkmen corpus
A complete set of tools is available to work with this Turkmen corpus to generate:
word lists – lists of Turkmen words organized by frequency