tiWaC: Tigrynia web corpus
The Tigrynia web corpus (tiWac) is a Tigrinia corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider in January 2016 and comprised of 2 million words.
Document count – the most frequent web domains and domain size distribution:
|Top level domains||Web domains||Secon level domain size distribution|
|org||1,023||*.blogspot.com||349||At least 1000 documents||0|
|com||699||*.jw.org||307||At least 500 documents||0|
|net||55||tewahdo.org||174||At least 100 documents||4|
|harnnet.org||116||At least 50 documents||8|
|eritreantewahdo.org||97||At least 10 documents||28|
|mekaleh-eritra.org||78||At least 5 documents||42|
|mahberemariamisrael.com||76||At least 1 document||129|
The content of news/politics and religious sites has a significant presence in the corpus sources.
This Tigrinia corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/TigrinyaCorpus
The tiWaC corpus contains POS annotation based on Universal dependencies, a multilingual parser development.
Tools to work with the Tigrynia Web corpus
A complete set of Sketch Engine tools is available to work with this Tigrynia corpus from the web to generate:
Corpus factory method
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.
Search the Tigrinya corpus
Sketch Engine offers a range of tools to work with this Tigrinya corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.