Yoruba corpus (yoWaC) | Sketch Engine

yoWaC: Corpus of the Yoruba Web

The Yoruba Web corpus (YorubaWaC) is a Yoruba corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

Data was crawled by the SpiderLing web spider and the WebBootCat tool in 2012 with a final size of 2.8 million words.

Tools to work with the Yoruba corpus

A complete set of Sketch Engine tools is available to work with this Yoruba Web corpus to generate:

word lists – lists of Yoruba nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Changelog

version 2 (17 January 2012)

corpus tagged using a new POS tagger (77.63% accuracy), lemmatizer and morph analyser downloaded from http://sivareddy.in/downloads

Bibliography

Sketch Engine general reference

Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel. The Sketch Engine: ten years on. Lexicography, 1: 7-36, 2014.

@article{kilgarriff2014sketch,
  title={The Sketch Engine: ten years on},
  author={Kilgarriff, Adam and Baisa, Vít and Bušta, Jan and Jakubíček, Miloš and Kovář, Vojtěch and Michelfeit, Jan and Rychlý, Pavel and Suchomel, Vít},
  journal={Lexicography},
  year={2014},
  volume={1},
  pages={7--36},
  publisher={Springer}
}

WaC corpora

Adam Kilgarriff, Siva Reddy, Jan Pomikálek, Avinesh PVS. A Corpus Factory for Many Languages. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), 1: 904-910, 2010.

@article{kilgarriff2010corpus,
  title={A Corpus Factory for Many Languages},
  author={Kilgarriff, Adam and Reddy, Siva and Pomikálek, Jan and PVS, Avinesh},
  journal={Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)},
  year={2010},
  volume={1},
  pages={904--910},
  publisher={European Language Resources Association (ELRA)}
}

Adam Kilgarriff, Marco Baroni. Large linguistically-processed web corpora for multiple languages. Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations, 1: 87-90, 2006.

@article{kilgarriff2006large,
  title={Large linguistically-processed web corpora for multiple languages},
  author={Kilgarriff, Adam and Baroni, Marco},
  journal={Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters \& Demonstrations},
  year={2006},
  volume={1},
  pages={87--90},
  publisher={Association for Computational Linguistics}
}

Search the Yoruba corpus

Sketch Engine offers a range of tools to work with the Yoruba corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

yoWaC: Corpus of the Yoruba Web

Tools to work with the Yoruba corpus

version 2 (17 January 2012)

Search the Yoruba corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine