The Estonian National Corpus 2019 (Estonian NC 2019)

The Estonian National Corpus is a language corpus made up of texts collected from various domains. The last version of the corpus consists of the Estonian Reference Corpus (texts from the 90s until 2008 compiled by Tartu University), Estonian Web (2013, 2017, 2019), Estonian Wikipedia (2017 and 2019), and Estonian DOAJ (2020).  It contains 1.5 billion words, and the last data were crawled at the beginning of the year 2020.

There are two types of POS tag attribute:

  • the abbreviated tag contains only basic information about part of speech (see the overview below),
  • the longtag contains detailed information, including other categories for particular parts of speech.

Part-of-speech tagset

The Estonian National Corpus is a morphologically annotated corpus by the tagging tool EstNLTK v1.6.

Abbreviated part-of-speech tags:

A  Adjective (positive)
C  Adjective (comparative)
D  Adverb
G  Genitive attribute, i.e., indeclinable adjective
H  Proper noun
I  Interjection
J  Conjunction
K  Adposition (pre- or postposition)
N  Numeral (cardinal)
O  Numeral (ordinal)
P  Pronoun
S  Common noun
U  Adjective (superlative)
V  Verb
X  Verb particle
Y  Abbreviation or acronym
Z  Punctuation

Overview of Estonian National Corpus versions

The Estonian National Corpus has the following versions:

  • Estonian National Corpus 2019 (Estonian NC 2019) – 1.5 billion words, comprised of Estonian Reference Corpus (90s–2008), Estonian Web (2013, 2017, 2019), Estonian Wikipedia (2017 and 2019) and Estonian DOAJ (2020). Cleaned, deduplicated. Text type annotation.
  • Estonian National Corpus 2017 (Estonian NC 2017) – 1.1 billion words, comprised of Estonian Reference Corpus (90s–2008), Estonian Web (2013 and 2017), Estonian Wikipedia (2017)
  • Estonian National Corpus 2013 (Estonian NC 2013) – 463 million words, comprised of the Estonian Reference Corpus (90s–2008), Estonian Web (2013)
  • Estonian Reference corpus 1990-2008 (EstonianRC) – 203 million words (written texts).

Estonian National Corpus in details

The chart shows the distribution of the parts of speech in the Estonian National Corpus 2017.

Distribution of parts of speech

Basic information

Frequency*
Tokens 1 371
Words 1 126
Sentences 125
Documents 3

* the figures above are rounded to million

Distribution of particular corpora

Tools to work with Estonian National Corpus

A complete set of Sketch Engine tools is available to work with this Estonian corpus to generate:

  • word sketch – Estonian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Estonian National Corpus 2019 (Estonian NC 2019)

  • 1.5 billion words – Estonian Reference Corpus (90s–2008), Estonian Web (2013, 2017, 2019), Estonian Wikipedia (2017 and 2019) and Estonian DOAJ (2020)
  • text type annotation

Estonian National Corpus 2017 (Estonian NC 2017)

  • 1.1. billion words – new crawled web data, Estonian Wikipedia + all previous versions
  • improved word sketch grammar

Estonian National Corpus 2013 (Estonian NC 2013)

  • 463 million words – Estonian Web corpus + written texts

Estonian Reference corpus 1990-2008 (EstonianRC)

  • 203 million words (written texts)

For more information about the corpus including longtag summary, see the Estonian Reference corpus document.

Search the Estonian National Corpus

Sketch Engine offers a range of tools to work with the Estonian National Corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 500+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.