The Estonian National Corpus 2019 (Estonian NC 2019)
The Estonian National Corpus is a language corpus made up of texts collected from various domains. The last version of the corpus consists of the Estonian Reference Corpus (texts from the 90s until 2008 compiled by Tartu University), Estonian Web (2013, 2017, 2019), Estonian Wikipedia (2017 and 2019), and Estonian DOAJ (2020). It contains 1.5 billion words, and the last data were crawled at the beginning of the year 2020.
There are two types of POS tag attribute:
- the abbreviated tag contains only basic information about part of speech (see the overview below),
- the longtag contains detailed information, including other categories for particular parts of speech.
The Estonian National Corpus is a morphologically annotated corpus by the tagging tool Filosoft.
Abbreviated part-of-speech tags:
|| Adjective (positive)
|| Adjective (comparative)
|| Genitive attribute, i.e., indeclinable adjective
|| Proper noun
|| Adposition (pre- or postposition)
|| Numeral (cardinal)
|| Numeral (ordinal)
|| Common noun
|| Adjective (superlative)
|| Verb particle
|| Abbreviation or acronym