etSKELL: Estonian corpus for SKELL

The Estonian corpus for Learners 2020 (etSKELL) is an Estonian corpus made up of sentences collected from the Estonian NC 2019 corpus, the Estonian Collocation Dictionary, and the Estonian Trends corpus. It was specially built up in order to provide the best sentence examples using GDEX system (see more below). The corpus is created in cooperation with the Institute of the Estonian Language using a GDEX configuration for Estonian developed by Kristina Koppel.

SKELL

SKELL is an abbreviation of Sketch Engine for Language Learning. It is a freely available web interface suitable for Estonian learning.

Part-of-speech tagset

The etSKELL corpus was tagged by PoS tagger developed at the University of Tartu using the estNLTK tools Filosoft part-of-speech tagset.

Good sentence examples (GDEX)

The Estonian corpus for Learners 2020 (etSKELL) consists of separate sentences which were sorted according to their quality. This quality is computed by GDEX system that adds a score to each sentence. The score is mainly based on the sentence length (minimum and maximum length) and the word frequency of particular words which occur in the sentence. The sentences are sorted in the way that the sentences with the highest score are displayed as the first results of a concordance.

What is specific for Estonian GDEX configuration compared to other languages, is that shorter sentences and sentences with wider optimal intervals are preferred. Instead of penalizing long words, the Estonian configuration does not allow words longer than 20 characters. Estonian configuration also penalizes sentences containing more than one adverb, one pronoun, one proper name, one numeral, one conjunction, one comma, or two verbs. There is a classifier that only selects sentences containing a verb and a classifier that penalizes certain non-finite constructions.

Tools to work with the Estonian corpus

A complete set of tools is available to work with this Estonian corpus to generate:

  • word sketch – Estonian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

etSKELL 0.4

  • added dynamic structure attributes

etSKELL 0.3

  • initial version with 294 million tokens

Estonian GdEX configuration

Koppel, K. 2017. Heade näitelausete automaattuvastamine eesti keele õppesõnastike jaoks. Eesti Rakenduslingvistika Ühingu aastaraamat, 13, 53−71.10.5128/ERYa13.04.

Kosem, I., Koppel, K., Kuhn, T. Z., Michelfeit, J., Tiberius, C. 2018. “Identification and automatic extraction of good dictionary examples: the case(s) of GDEX. International Journal of Lexicography, ecy014, https://doi.org/10.1093/ijl/ecy014

SKELL corpus

BAISA, Vít a Vít SUCHOMEL. SKELL – Web Interface for English Language Learning. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, pp. 63-70. ISSN 2336-4289.

References to SKELL and versioning

From time to time, the underlying corpus data may change (cleaning, refining etc.). To refer to particular results (using bookmarked URLs for example), also refer to a particular version. The web interface may also change occasionally. Each SKELL page carries a version via link “Terms” in the left corner at the bottom, e.g. VERSION1-VERSION2. This refers to the version of the interface and the version of the corpus data respectively.

Search the Estonian corpus

Use a free web interface suitable for Estonian learners.

Other Estonian corpora

Explore our other Estonian language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.