Afrikaans corpus from Wikipedia
The Afrikaans Wikipedia Corpus (afwiki) is an Afrikaans corpus made up of texts collected from Afrikaans internet encyclopedia Wikipedia in March 2018. The corpus consists of 15 million words.
Tools to work with the Afrikaans corpus
A complete set of tools is available to work with this Wikipedia Afrikaans corpus to generate:
- word lists – lists of Afrikaans words organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
English Web 2015 (enTenTen15)
- initial size 28 billion words
v2 (spring 2017)
- 15 billion words
- genre classification
- depth analysis of spam and its removal including too short documents
English Web 2013 (enTenTen13)
- 19 billion words
English Web 2012 (enTenTen12)
version 1 (14 June 2012)
- sample of corpus – 3.7 billion words
- crawled by SpiderLing in May 2012
- encoded in UTF-8
version 2 (2012)
- full corpus – 11 billion words
English Web 2008 (enTenTen08)
version 1 (15 November 2010)
- initial version – 3.3 billion tokens
- crawled by Heritrix in 2008
- encoded in Latin1
Bibliography
TenTen corpora
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Search the Afrikaans corpus
Sketch Engine offers a range of tools to work with this Afrikaans corpus from Wikipedia.
or
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.