orWaC – Oromo corpus from the web

orWaC: Oromo corpus from the web

The Oromo web corpus (orWaC) is an Oromo corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

Data was crawled by the SpiderLing web spider in January 2016 and comprised of 4 million words.

Document count – the most frequent web domains and domain size distribution:

Top level domains		Web domains		Second level domain size distribution
org	5,676	*.jw.org	2,695	At least 1000 documents	2
com	2,054	qeerroo.org	1,010	At least 500 documents	4
net	839	vidoser.org	632	At least 100 documents	16
et	213	gadaa.net\|com	518	At least 50 documents	21
		*.voaafaanoromoo.com	438	At least 10 documents	45
		oromedia.net	304	At least 5 documents	60
		bilisummaa.com	291	At least 1 document	190
		*.blogspot.com	287
		oromiatimes.org	276

The content of news/politics and religious sites has a significant presence in the corpus sources.

The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/OromoCorpus

Part-of-speech tagset

The orWaC corpus contains POS annotation based on Universal dependencies, a multilingual parser development.

Tools to work with the Oromo corpus

A complete set of Sketch Engine tools is available to work with this Oromo corpus from the web to generate:

word sketch – Oromo collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word units
word lists – lists of Oromo nouns, verbs, adjectives, etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Bibliography

Corpus factory method

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.

Search the Oromo corpus

Sketch Engine offers a range of tools to work with this Oromo corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide