daTenTen: Corpus of the Danish Web
The Danish Web Corpus (daTenTen) is a Danish corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
The data for the last version of the Danish Web corpus was crawled by web spider SpiderLing between June and August 2020. It is comprised of more than 3.4 billion words.
The corpus possesses common TenTen corpora attributes.
The texts are tagged by TreeTagger with a Danish model respecting ePos tagset trained using the ePAROLE corpus.
The old version of Danish Web 2014 was tagged by CST’s TaggerXML with the following tagset.