Afrikaans corpus from Wikipedia

The Afrikaans Wikipedia Corpus (afwiki) is an Afrikaans corpus made up of texts collected from the Afrikaans internet encyclopedia Wikipedia in early October 2022. The corpus consists of 22 million words.

Part-of-speech tagset

The Afrikaans corpus from Wikipedia has been tagged by NCHLT tagger (derived from HunPos) using the following tagset.

Tools to work with the Afrikaans corpus

A complete set of tools is available to work with this Wikipedia Afrikaans corpus to generate:

word sketch – Afrikaans collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Afrikaans nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus