DAGW: Danish Gigaword Corpus

The Danish Gigaword Corpus (DAGW) is a 964-million-word Danish corpus made up of texts collected from the Internet. The corpus texts consist of various web sources such as European Parliaments, OPUS, Wikipedia, etc. The Danish Gigaword Corpus was created by Leon Derczynski and Manuel R. Ciosici and it is freely distributed with attribution. In comparison with the original Danish Gigaword corpus, the Sketch Engine version of the corpus is smaller (approx. 80 million words less) because General Discussions and Parliament Elections sections were not included.

For further information, visit the homepage of the Danish Gigaword Project.

Part-of-speech tagset

The Danish Gigaword corpus was tagged by Sketch Engine using TreeTagger with a Danish model respecting the ePos tagset trained using the ePAROLE corpus.

Copyright

Texts in the corpus are provided under Creative Commons Attribution 4.0 International (CC BY 4.0).

Sample attributions

The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen.

Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).

Danish Gigaword corpus in detail

Basic statistics of the corpus

Frequency*
Tokens 1,197,941,586
Words 964,617,784
Sentences 56,979,231
Documents 511,160

A list of subcorpora

Subcorpus name Sources Size (in tokens) % of the whole corpus
Conversation Movie subtitles, Debates, Conversation, Speeches 329,037,536 27.5
Legal Laws, Tax code, Court cases 333,236,660 27.8
News News 44,472,637 3.7
Other Other, Sønderjysk 1,409,439 0.1
Social Media forum 257,051,120 21.5
Web Web 118,757,859 9.9
Wiki & Books Encyclopaedic, Literature, Manuals, JVJ’s works, Religious 113,976,335 9.5

A list of text types

Dialect – Danish dialect

Section – it corresponds to a single source of text

Publication date – the publication date of the source document

Year of publication – the year CE that the source document was published

Document ID – document ID corresponds to the original filename

Form – a form of the text – written or spoken

Detailed information on text types available in the Danish Gigaword corpus can be found at http://www.derczynski.com/papers/dagw.pdf

Tools to work with the Danish Gigaword Corpus

A complete set of tools is available to work with this Danish corpus to generate:

  • word sketch – Danish collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of Danish nouns, verbs, adjectives, etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).

website: https://gigaword.dk/

Search the Danish Gigaword Corpus

Sketch Engine offers a range of tools to work with this Danish corpus from the web.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.