amWacC – Amharic corpus from the web

amWaC: Amharic corpus from the web

The Amharic web corpus (amWaC) is an Amharic corpus made up of texts collected from the Internet. The corpus was prepared according to the standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

Data was crawled by the SpiderLing web spider three times, in August 2013, October 2015, January 2016 and 2017 with a final size of almost 26 million words. Texts are in the Ge’ez script with matching SERA transliteration (The system for Ethiopic representation in ASCII).

Transliteration of selected Ge’ez characters into SERA system (Latin script).

Document count – the most frequent web domains and domain size distribution:

Top level domains		Web domains		Domain size distribution
com	33,503	gov.et	21,224	At least 1000 documents	12
et	21,731	blogspot.com	10,328	At least 500 documents	6
org	18,631	jw.org	6,512	At least 100 documents	29
net	1,295	addisadmassnews.com	3,751	At least 50 documents	19
va	106	wikipedia.org	3,228	At least 10 documents	69
others	243	ethiopiazare.com	3,045	At least 1 document	202

The content of news/political and religious sites has a significant presence in the corpus sources.

The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more on https://habit-project.eu/wiki/AmharicCorpus

Part-of-speech tagset

The AmharicWaC corpus was tagged with the TreeTagger based on manual annotation of Amharic 1065 news items containing 210,000 prosodic words. See the Amharic part-of-speech tag legend.

Tools to work with the Amharic corpus

A complete set of tools is available to work with this Amharic corpus to generate:

word sketch – Amharic collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word units
word lists – lists of Amharic nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

version 1 (21st April 2017)

created word sketches
added attribute “sera”

initial version (5th April 2017)

size 17 million words

Bibliography

Amharic web corpus

Rychlý, P., & Suchomel, V. (2016, September). Annotated Amharic Corpora. In International Conference on Text, Speech, and Dialogue (pp. 295-302). Springer International Publishing.

Corpus factory method

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.

Search the Amharic corpus

Sketch Engine offers a range of tools to work with this Amharic corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

amWaC: Amharic corpus from the web

Part-of-speech tagset

Tools to work with the Amharic corpus

version 1 (21st April 2017)

initial version (5th April 2017)

Amharic web corpus

Corpus factory method

Search the Amharic corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

Amharic corpus (amWaC)

amWaC: Amharic corpus from the web

Part-of-speech tagset

Tools to work with the Amharic corpus

version 1 (21st April 2017)

initial version (5th April 2017)

Amharic web corpus

Corpus factory method

Search the Amharic corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine