ANW: Algemeen Nederlands Woordenboek
The Algemeen Nederlands Woordenboek (ANW) is a Dutch corpus made up of texts from various domains. The ANW corpus is a balanced corpus of just over 100 million words which was compiled at the Institute for Dutch Lexicology (INL) and completed in 2004.
The ANW corpus comprises:
- present-day literary texts (20%)
- texts containing neologisms (5%)
- texts of various domains in the Netherlands and Flanders (32%)
- newspaper texts (40%)
The remainder is the ‘Pluscorpus’ which consists of texts, downloaded from the internet, with words that were present in an INL word list but absent in a first version of the corpus. To support searches by lemma and part of speech, the corpus has been annotated with lemmas and POS-tags using the technology which was originally developed for the Dutch PAROLE corpus (Does, Van der Voort van der Kleij 2002): a combination of statistical taggers including TnT3 and three taggers developed at the INL. Lemmatisation was a deterministic procedure, based on an extensive lexicon developed within INL.
More information about the corpus is available here (in Dutch).
The ANW corpus was tagged with using the following POS tagset.