Turkic corpora from the Web

The Turkic Web Corpora are a set of corpora made up of texts collected from the Internet. They include six Turkic languages, each fo one in the separate corpus:

Azerbaijani corpus Kazakh corpus Kyrgyz corpus
Turkmen corpus Turkish corpus Uzbek corpus

For more information visit an info page for the particular language.

The overview of Turkic corpora

LANGUAGE WORDS DOCUMENTS (in thousands) DATA UPDATES
AZERBAIJANI 94 million 365 thousand Jan 2012
KAZAKH 139 million 378 thousand Jan 2012
KYRGYZ 19 million 67 thousand Jan 2012
TURKISH 3.38 billion 12 million Dec 2011, Jan 2012
TURKMEN 2 million 5 thousand Jan 2012
UZBEK 18 million 57 thousand Jan 2012

Source data

The source texts were crawled by the SpiderLing web spider in December 2011 and January 2012. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (.az, .kz, .kg, .tr, .tm, .uz), several exceptions were allowed.

Tools to work with the Turkic corpora

A complete set of Sketch Engine tools is available to work with these Turkic web corpora to generate:

  • word sketch– collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords– terminology extraction of one-word
  • text type analysis – statistics of metadata in the corpus

September 23, 2012

  • The Turkic part crawled from the Turkish domain .tr was renamed to trTenTen [2012]

initial version (March 6, 2012)

  • initial version, 6 languages
  • no tagging, no sketches

Turkic Web corpora

Vít Baisa and Vít Suchomel (2012). Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32.

Search the Turkic corpora

Sketch Engine offers a range of tools to work with these Turkic corpora.

or

Other text corpora

Sketch Engine offers 700+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.