Afrikaans corpus from Wikipedia

The Afrikaans Wikipedia Corpus (afwiki) is an Afrikaans corpus made up of texts collected from the Afrikaans internet encyclopedia Wikipedia in early October 2022.  The corpus consists of 22 million words.

Part-of-speech tagset

The Afrikaans corpus from Wikipedia has been tagged by NCHLT tagger (derived from HunPos) using the following tagset.

Tools to work with the Afrikaans corpus

A complete set of tools is available to work with this Wikipedia Afrikaans corpus to generate:

  • word sketch – Afrikaans collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of Afrikaans nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Search the Afrikaans corpus

Sketch Engine offers a range of tools to work with this Afrikaans corpus from Wikipedia.

Your own Wikipedia corpora

We can build a Wikipedia corpus in any language for you. Please contact us.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.