Afrikaans corpus from Wikipedia

The Afrikaans Wikipedia Corpus (afwiki) is an Afrikaans corpus made up of texts collected from Afrikaans internet encyclopedia Wikipedia in March 2018.  The corpus consists of 15 million words.

Tools to work with the Afrikaans corpus

A complete set of tools is available to work with this Wikipedia Afrikaans corpus to generate:

  • word lists – lists of Afrikaans words organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

English Web 2015 (enTenTen15)

  • initial size 28 billion words

v2 (spring 2017)

  • 15 billion words
  • genre classification
  • depth analysis of spam and its removal including too short documents

English Web 2013 (enTenTen13)

  • 19 billion words

English Web 2012 (enTenTen12)

version 1 (14 June 2012)

  • sample of corpus – 3.7 billion words
  • crawled by SpiderLing in May 2012
  • encoded in UTF-8

version 2 (2012)

  • full corpus – 11 billion words

English Web 2008 (enTenTen08)

version 1 (15 November 2010)

  • initial version – 3.3 billion tokens
  • crawled by Heritrix in 2008
  • encoded in Latin1

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Afrikaans corpus

Sketch Engine offers a range of tools to work with this Afrikaans corpus from Wikipedia.

or

Your own Wikipedia corpora

We can build a Wikipedia corpus in any language for you. Please contact us.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.