Turkic corpora from the Web
The Turkic Web Corpora are a set of corpora made up of texts collected from the Internet. They include six Turkic languages, each fo one in the separate corpus:
|Azerbaijani corpus||Kazakh corpus||Kyrgyz corpus|
|Turkmen corpus||Turkish corpus||Uzbek corpus|
For more information visit an info page for the particular language.
The overview of Turkic corpora
|LANGUAGE||WORDS||DOCUMENTS (in thousands)||DATA UPDATES|
|AZERBAIJANI||94 million||365 thousand||Jan 2012|
|KAZAKH||139 million||378 thousand||Jan 2012|
|KYRGYZ||19 million||67 thousand||Jan 2012|
|TURKISH||3.38 billion||12 million||Dec 2011, Jan 2012|
|TURKMEN||2 million||5 thousand||Jan 2012|
|UZBEK||18 million||57 thousand||Jan 2012|
The source texts were crawled by the SpiderLing web spider in December 2011 and January 2012. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (.az, .kz, .kg, .tr, .tm, .uz), several exceptions were allowed.
Tools to work with the Turkic corpora
A complete set of Sketch Engine tools is available to work with these Turkic web corpora to generate:
September 23, 2012
- The Turkic part crawled from the Turkish domain .tr was renamed to trTenTen 
initial version (March 6, 2012)
- initial version, 6 languages
- no tagging, no sketches
Turkic Web corpora
Vít Baisa and Vít Suchomel (2012). Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32.
Search the Turkic corpora
Sketch Engine offers a range of tools to work with these Turkic corpora.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.