Turkic corpora from the Web
The Turkic Web Corpora are a set of corpora made up of texts collected from the Internet. They include six Turkic languages, each fo one in the separate corpus:
For more information visit an info page for the particular language.
The overview of Turkic corpora
||DOCUMENTS (in thousands)
||Dec 2011, Jan 2012
The source texts were crawled by the SpiderLing web spider in December 2011 and January 2012. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (.az, .kz, .kg, .tr, .tm, .uz), several exceptions were allowed.