Turkic corpora from the web

Turkic corpora from the Web

The Turkic Web Corpora are a set of corpora made up of texts collected from the Internet. They include six Turkic languages, each fo one in the separate corpus:

Azerbaijani corpus	Kazakh corpus	Kyrgyz corpus
Turkmen corpus	Turkish corpus	Uzbek corpus

For more information visit an info page for the particular language.

The overview of Turkic corpora

LANGUAGE	WORDS	DOCUMENTS (in thousands)	DATA UPDATES
AZERBAIJANI	94 million	365 thousand	Jan 2012
KAZAKH	139 million	378 thousand	Jan 2012
KYRGYZ	19 million	67 thousand	Jan 2012
TURKISH	3.38 billion	12 million	Dec 2011, Jan 2012
TURKMEN	2 million	5 thousand	Jan 2012
UZBEK	18 million	57 thousand	Jan 2012

Source data

The source texts were crawled by the SpiderLing web spider in December 2011 and January 2012. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (.az, .kz, .kg, .tr, .tm, .uz), several exceptions were allowed.

Tools to work with the Turkic corpora

A complete set of Sketch Engine tools is available to work with these Turkic web corpora to generate:

word sketch– collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Changelog

September 23, 2012

The Turkic part crawled from the Turkish domain .tr was renamed to trTenTen [2012]

initial version (March 6, 2012)

initial version, 6 languages
no tagging, no sketches

Bibliography

Turkic Web corpora

Vít Baisa and Vít Suchomel (2012). Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32.

Search the Turkic corpora

Sketch Engine offers a range of tools to work with these Turkic corpora.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide