soWaC: Somali corpus from the web
The Somali web corpus (soWac) is a Somali corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider in January 2016 and comprised of 71 million words.
Document count – the most frequent web domains and domain size distribution:
Top level domains | Web domains | Second level domain size distribution | |||
---|---|---|---|---|---|
net | 295,358 | risaala.net | 22,823 | At least 1000 documents | 73 |
org | 75,860 | goolfm.net | 22,544 | At least 500 documents | 96 |
com | 7,397 | vidinfo.org | 21,904 | At least 100 documents | 150 |
info | 4,577 | batalaalenews.net | 17,079 | At least 50 documents | 181 |
so | 1,930 | keydmedia.net | 15,453 | At least 10 documents | 352 |
alshahid.net | 13,923 | At least 5 documents | 487 | ||
daadmadheedhnews.net | 13,693 | At least 1 document | 1,083 | ||
somaliland.org | 13,203 | ||||
vidoser.org | 12,189 | ||||
somalilandpost.net | 10,853 | ||||
radiodanan.net | 10,196 | ||||
geeska.net | 8,378 | ||||
camuudnews.net | 8,218 | ||||
nogob.net | 7,045 | ||||
allsomali24.org | 6,755 | ||||
sagalradio.org | 6,154 | ||||
qarninews.net | 6,097 |
The content of news/politics and religious sites has a significant presence in the corpus sources.
The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/SomaliCorpus
Part-of-speech tagset
The soWaC corpus contains POS annotation based on Universal dependencies, a multilingual parser development.
Tools to work with the Somali corpus
A complete set of Sketch Engine tools is available to work with this Somali corpus from the web to generate:
- word sketch – Somali collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Somali nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Bibliography
Corpus factory method
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.
Search the Somali corpus
Sketch Engine offers a range of tools to work with this Somali corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.