soWaC: Oromo web corpus
The Somali web corpus (soWac) is a Somali corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider in January 2016 and comprised of 71 million words.
Document count – the most frequent web domains and domain size distribution:
|Top level domains||Web domains||Second level domain size distribution|
|net||295,358||risaala.net||22,823||At least 1000 documents||73|
|org||75,860||goolfm.net||22,544||At least 500 documents||96|
|com||7,397||vidinfo.org||21,904||At least 100 documents||150|
|info||4,577||batalaalenews.net||17,079||At least 50 documents||181|
|so||1,930||keydmedia.net||15,453||At least 10 documents||352|
|alshahid.net||13,923||At least 5 documents||487|
|daadmadheedhnews.net||13,693||At least 1 document||1,083|
The content of news/politics and religious sites has a significant presence in the corpus sources.
The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/SomaliCorpus
The soWaC corpus contains POS annotation based on Universal dependencies, a multilingual parser development.
Tools to work with the Somali corpus
A complete set of Sketch Engine tools is available to work with this Somali corpus from the web to generate:
Corpus factory method
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.
Search the Somali corpus
Sketch Engine offers a range of tools to work with this Somali corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.