Danish Gigaword Corpus | Sketch Engine

DAGW: Danish Gigaword Corpus

The Danish Gigaword Corpus (DAGW) is a 964-million-word Danish corpus made up of texts collected from the Internet. The corpus texts consist of various web sources such as European Parliaments, OPUS, Wikipedia, etc. The Danish Gigaword Corpus was created by Leon Derczynski and Manuel R. Ciosici and it is freely distributed with attribution. In comparison with the original Danish Gigaword corpus, the Sketch Engine version of the corpus is smaller (approx. 80 million words less) because General Discussions and Parliament Elections sections were not included.

For further information, visit the homepage of the Danish Gigaword Project.

Part-of-speech tagset

The Danish Gigaword corpus was tagged by Sketch Engine using TreeTagger with a Danish model respecting the ePos tagset trained using the ePAROLE corpus.

Copyright

Texts in the corpus are provided under Creative Commons Attribution 4.0 International (CC BY 4.0).

Sample attributions

In a press release:

The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen.

In academic writing:

Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).

Danish Gigaword corpus in detail

Basic statistics of the corpus

	Frequency*
Tokens	1,197,941,586
Words	964,617,784
Sentences	56,979,231
Documents	511,160

Further information about texts in the corpus

A list of subcorpora

Subcorpus name	Sources	Size (in tokens)	% of the whole corpus
Conversation	Movie subtitles, Debates, Conversation, Speeches	329,037,536	27.5
Legal	Laws, Tax code, Court cases	333,236,660	27.8
News	News	44,472,637	3.7
Other	Other, Sønderjysk	1,409,439	0.1
Social Media	forum	257,051,120	21.5
Web	Web	118,757,859	9.9
Wiki & Books	Encyclopaedic, Literature, Manuals, JVJ’s works, Religious	113,976,335	9.5

A list of text types

Dialect – Danish dialect

Section – it corresponds to a single source of text

Publication date – the publication date of the source document

Year of publication – the year CE that the source document was published

Document ID – document ID corresponds to the original filename

Form – a form of the text – written or spoken

Detailed information on text types available in the Danish Gigaword corpus can be found at http://www.derczynski.com/papers/dagw.pdf

Tools to work with the Danish Gigaword Corpus

A complete set of tools is available for working with this Danish corpus and generating:

word sketch – Danish collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Danish nouns, verbs, adjectives, etc. organized by frequency
n-grams – frequency lists of multi-word units
concordance – examples in context
trends – diachronic analysis automatically identifies neologisms and changes in use
text type analysis – statistics of metadata in the corpus

Bibliography

Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).

website: https://gigaword.dk/