The Reference Corpus of Estonian
The Estonian Reference Corpus is a language corpus made up of texts collected from various domains. The last version of the corpus contains written texts, web texts and Estonian Wikipedia (till autumn 2017). It is comprised of 1.1 billion words.
There two type of POS tag attribute:
- abbreviated tag contains only basic information about part of speech (see the overview below)
- longtag contains detailed information including other categories for particular parts of speech
Part-of-speech tagset
Estonian Reference Corpus is a morphologically annotated corpus by the tagging tool Filosoft.
Abbreviated part-of-speech tags:
A | Adjective (positive) |
C | Adjective (comparative) |
D | Adverb |
G | Genitive attribute, i.e. indeclinable adjective |
H | Proper noun |
I | Interjection |
J | Conjunction |
K | Adposition (pre- or postposition) |
N | Numeral (cardinal) |
O | Numeral (ordinal) |
P | Pronoun |
S | Common noun |
U | Adjective (superlative) |
V | Verb |
X | Verb particle |
Y | Abbreviation or acronym |
Z | Punctuation |
Overview of Estonian Reference corpora
The Estonian Reference corpus has the following version:
- Estonian Reference Corpus 2017 – 1.1 billion words (consisted of previous corpora, Estonian Web corpus 2013, new crawled data and Estonian Wikipedia)
- Estonian Reference Corpus with Web (Estonian NC 2013) – 463 million words (written texts + web texts of Estonian Web 2013)
- Estonian Reference corpus (EstonianRC) – 203 million words (written texts)
Estonian Reference corpus in detailed
The chart shows the distribution of the parts of speech in the Estonian Reference corpus 2017.
Distribution of parts of speech
Further information about texts in the corpus
Basic information
Frequency* | |
Tokens | 1 371 |
Words | 1 126 |
Sentences | 125 |
Documents | 3 |
* the figures above are rounded to million
Distribution of particular corpora
Tools to work with Estonian Reference corpus
A complete set of Sketch Engine tools is available to work with this Estonian corpus to generate:
- word sketch – Estonian collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
Estonian Reference corpus 2017
- 1.1. billion words – new crawled web data, Estonian Wikipedia + all previous versions
- improved word sketch grammar
Estonian Reference Corpus with Web (Estonian NC 2013)
- 463 million words – Estonian Web corpus + written texts
Estonian Reference corpus (EstonianRC)
- 203 million words (written texts)
Bibliography
For more information about the corpus including longtag summary, see the Estonia Reference corpus document.
Search the Estonian reference corpus
Sketch Engine offers a range of tools to work with the Estonian reference corpus.
or
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.