amWaC: Amharic corpus from the web
The Amharic web corpus (amWaC) is an Amharic corpus made up of texts collected from the Internet. The corpus was prepared according to the standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider three times, in August 2013, October 2015, January 2016 and 2017 with a final size of almost 26 million words. Texts are in the Ge’ez script with matching SERA transliteration (The system for Ethiopic representation in ASCII).
Transliteration of selected Ge’ez characters into SERA system (Latin script).
Document count – the most frequent web domains and domain size distribution:
|Top level domains
||Domain size distribution
||At least 1000 documents
||At least 500 documents
||At least 100 documents
||At least 50 documents
||At least 10 documents
||At least 1 document
The content of news/political and religious sites has a significant presence in the corpus sources.
The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more on https://habit-project.eu/wiki/AmharicCorpus
The AmharicWaC corpus was tagged with the TreeTagger based on manual annotation of Amharic 1065 news items containing 210,000 prosodic words. See the Amharic part-of-speech tag legend.