yoWaC: Corpus of the Yoruba Web
The Yoruba Web corpus (YorubaWaC) is a Yoruba corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Tools to work with the Yoruba corpus
A complete set of Sketch Engine tools is available to work with this Yoruba Web corpus to generate:
version 2 (17 January 2012)
- corpus tagged using a new POS tagger (77.63% accuracy), lemmatizer and morph analyser downloaded from http://sivareddy.in/downloads
BARONI, Marco; KILGARRIFF, Adam. Large linguistically-processed web corpora for multiple languages. In: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics, 2006, pp. 87–90.
Reddy, S., & Sharoff, S. (2011, November). Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources. In Proceedings of the Fifth International Workshop On Cross Lingual Information Access (pp. 11-19).
Search the Yoruba corpus
Sketch Engine offers a range of tools to work with the Yoruba corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.