soWaC – Somali corpus from the web

soWaC: Somali corpus from the web

The Somali web corpus (soWac) is a Somali corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

Data was crawled by the SpiderLing web spider in January 2016 and comprised of 71 million words.

Document count – the most frequent web domains and domain size distribution:

Top level domains		Web domains		Second level domain size distribution
net	295,358	risaala.net	22,823	At least 1000 documents	73
org	75,860	goolfm.net	22,544	At least 500 documents	96
com	7,397	vidinfo.org	21,904	At least 100 documents	150
info	4,577	batalaalenews.net	17,079	At least 50 documents	181
so	1,930	keydmedia.net	15,453	At least 10 documents	352
		alshahid.net	13,923	At least 5 documents	487
		daadmadheedhnews.net	13,693	At least 1 document	1,083
		somaliland.org	13,203
		vidoser.org	12,189
		somalilandpost.net	10,853
		radiodanan.net	10,196
		geeska.net	8,378
		camuudnews.net	8,218
		nogob.net	7,045
		allsomali24.org	6,755
		sagalradio.org	6,154
		qarninews.net	6,097

The content of news/politics and religious sites has a significant presence in the corpus sources.

The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/SomaliCorpus

Part-of-speech tagset

The soWaC corpus contains POS annotation based on Universal dependencies, a multilingual parser development.

Tools to work with the Somali corpus

A complete set of Sketch Engine tools is available to work with this Somali corpus from the web to generate:

word sketch – Somali collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Somali nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Bibliography

Corpus factory method

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.

Search the Somali corpus

Sketch Engine offers a range of tools to work with this Somali corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide