Icelandic Gigaword Corpus 2017

The Icelandic Gigaword Corpus 2017 is an Icelandic corpus made up of texts collected from the Internet. The texts are official texts (e.g. parliamentary speeches, law texts), texts from news media or other sources.

Each text is accompanied by metadata (author, document title, publication date etc.), which is possible to view using the Text Type Analysis. The corpus is intended for linguistic research and for use in language technology projects.

The official documentation is available at: https://clarin.is/en/resources/gigaword/

Note: According to the official website, the corpus is divided into two parts – IGC1 and IGC2. Only the second part IGC2 is freely available and accessible in Sketch Engine.

Part-of-speech tagset and lemmatization

The isTenTen Icelandic corpus was part-of-speech tagged by IceNLP toolkit with IFD Tagset.

Icelandic Gigaword Corpus 2017 corpus sizes

Frequency
Tokens 600,301,903
Words 532,028,866
Sentences 27,252,906
Documents 1,550,779

Search the Icelandic Gigaword Corpus 2017

Sketch Engine offers a range of tools to work with this Icelandic corpus.

Tools to work with the Icelandic corpus

A set of Sketch Engine tools is available to work with this Icelandic corpus to generate:

  • word sketch – Icelandic collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word
  • word lists – lists of Icelandic nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Icelandic Gigaword Corpus 2017 (icelandic_gigaword17)

version icelandic_gigaword17 (February 2024)

Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.

Other Icelandic corpora

Explore other Icelandic corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.