SoNaR: Dutch reference corpus
The SoNaR corpus (Stevin Nederlandstalig Referentie corpus) is a Dutch reference corpus consists of 500 million tokens. The corpus is balanced for research on the contemporary (1954–2011) written Dutch language. There is also balance in view of the number of speakers in Dutch-speaking regions, one-third of the texts coming from Flanders, and two-thirds from the Netherlands. Corpus texts are comprised of newspapers, reports, etc. as well as chat, SMS, internet fora and email.
More information about the SoNaR corpus can be found at https://www.lt3.ugent.be/projects/sonar/
This corpus was POS tagged with the TreeTagger tool using the following Dutch tagset legend.