Icelandic Gigaword Corpus 2017

The Icelandic Gigaword Corpus 2017 is an Icelandic corpus made up of texts collected from the Internet. The texts are official texts (e.g. parliamentary speeches, law texts), texts from news media or other sources.

Each text is accompanied by metadata (author, document title, publication date etc.), which is possible to view using the Text Type Analysis. The corpus is intended for linguistic research and for use in language technology projects.

The official documentation is available at: https://clarin.is/en/resources/gigaword/

Note: According to the official website, the corpus is divided into two parts – IGC1 and IGC2. Only the second part IGC2 is freely available and accessible in Sketch Engine.

Part-of-speech tagset and lemmatization

The isTenTen Icelandic corpus was part-of-speech tagged by IceNLP toolkit with IFD Tagset.

Icelandic Gigaword Corpus 2017 corpus sizes

	Frequency
Tokens	600,301,903
Words	532,028,866
Sentences	27,252,906
Documents	1,550,779

Search the Icelandic Gigaword Corpus 2017

Sketch Engine offers a range of tools to work with this Icelandic corpus.

open in Sketch Engine

about Sketch Engine

Tools to work with the Icelandic corpus

A set of Sketch Engine tools is available to work with this Icelandic corpus to generate:

word sketch – Icelandic collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word
word lists – lists of Icelandic nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

Icelandic Gigaword Corpus 2017 (icelandic_gigaword17)

version icelandic_gigaword17 (February 2024)

Bibliography

Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.

Other Icelandic corpora

Explore other Icelandic corpora

available Icelandic corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

Icelandic Gigaword Corpus 2017

Part-of-speech tagset and lemmatization

Icelandic Gigaword Corpus 2017 corpus sizes

Search the Icelandic Gigaword Corpus 2017

Tools to work with the Icelandic corpus

Icelandic Gigaword Corpus 2017 (icelandic_gigaword17)

Other Icelandic corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine