tgWaC: Corpus of Tajik Web
The Tajik Web Corpus (tgWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). Data was crawled by the SpiderLing web spider in the years 2011–2013 and comprise of more than 93 million words with part-of-speech tagging.
Authors of this corpus are Vít Suchomel and Pavel Šmerk.
The POS tags were created from the lemma of given word and the numbers determining one of 16 POS categories, see the part-of-speech tagset legend.
Tools to work with the Tajik corpus
A complete set of Sketch Engine tools is available to work with this Tajik Web corpus to generate:
- corpus extended – 93 million words
- corpus was tagged – tag consisted of lemma and POS
- corpus created – ca 50 million words
DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. POS Annotated 50M Corpus of Tajik Language. In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL 8/AfLaT 2012). Istanbul: European Language Resources Association (ELRA), 2012, pp. 93–98. ISBN 978-2-9517408-7-7.
DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. Towards 100M Morphologically Annotated Corpus of Tajik. In Aleš Horák, Pavel Rychlý. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. Brno: Tribun EU, 2012, pp. 91–94. ISBN 978-80-263-0313-8.
Search the Tajik Web corpus
Sketch Engine offers a range of tools to work with the Tajik corpus from the Web.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.