The Reference Corpus of Estonian

The Estonian Reference Corpus is a language corpus made up of texts collected from various domains. The last version of the corpus contains written texts, web texts and Estonian Wikipedia (till autumn  2017). It is comprised of 1.1 billion words.

There two type of POS tag attribute:

  • abbreviated tag contains only basic information about part of speech (see the overview below)
  • longtag contains detailed information including other categories for particular parts of speech

Part-of-speech tagset

Estonian Reference Corpus is a morphologically annotated corpus by the tagging tool Filosoft.

Abbreviated part-of-speech tags:

A  Adjective (positive)
C  Adjective (comparative)
D  Adverb
G  Genitive attribute, i.e. indeclinable adjective
H  Proper noun
I  Interjection
J  Conjunction
K  Adposition (pre- or postposition)
N  Numeral (cardinal)
O  Numeral (ordinal)
P  Pronoun
S  Common noun
U  Adjective (superlative)
V  Verb
X  Verb particle
Y  Abbreviation or acronym
Z  Punctuation

Overview of Estonian Reference corpora

The Estonian Reference corpus has the following version:

  • Estonian Reference Corpus 2017 – 1.1 billion words (consisted of previous corpora, Estonian Web corpus 2013, new crawled data and Estonian Wikipedia)
  • Estonian Reference Corpus with Web (Estonian NC 2013) – 463 million words (written texts + web texts of Estonian Web 2013)
  • Estonian Reference corpus (EstonianRC) – 203 million words (written texts)

Estonian Reference corpus in detailed

The chart shows the distribution of the parts of speech in the Estonian Reference corpus 2017.

Distribution of parts of speech

Basic information

Tokens 1 371
Words 1 126
Sentences 125
Documents 3

* the figures above are rounded to million

Distribution of particular corpora

Tools to work with Estonian Reference corpus

A complete set of Sketch Engine tools is available to work with this Estonian corpus to generate:

  • word sketch – Estonian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Estonian Reference corpus 2017

  • 1.1. billion words – new crawled web data, Estonian Wikipedia + all previous versions
  • improved word sketch grammar

Estonian Reference Corpus with Web (Estonian NC 2013)

  • 463 million words – Estonian Web corpus + written texts

Estonian Reference corpus (EstonianRC)

  • 203 million words (written texts)

For more information about the corpus including longtag summary, see the Estonia Reference corpus document.

