etSkELL: Estonian corpus for SkELL

The Estonian corpus for Learners 2018 (etSkELL) is an Estonian corpus made up of sentences collected from the Estonian NC 2017 corpus and the Estonian coursebook corpus 2018. It was specially built up in order to provide the best sentence examples using GDEX system (see more below). The corpus is created in cooperation with the Institute of the Estonian Language using a GDEX configuration for Estonian developed by Kristina Koppel.

SkELL

SkELL is an abbreviation of Sketch Engine for Language Learning. It is a freely available web interface suitable for Estonian learning.

Part-of-speech tagset

The etSkELL corpus was tagged by PoS tagger developed at the University of Tartu using the estNLTK tools Filosoft part-of-speech tagset.

Good sentence examples (GDEX)

The Estonian corpus for Learners 2018 (etSkELL) consists of separate sentences which were sorted according to their quality. This quality is computed by GDEX system that adds a score to each sentence. The score is mainly based on the sentence length (minimum and maximum length) and a word frequency of particular words which occur in the sentence. The sentences are sorted in the way that the sentences with the highest score are displayed as first results of a concordance.

What is specific for Estonian GDEX configuration compared to other languages, is that shorter sentences and sentences with wider optimal interval are preferred. Instead of penalizing long words, Estonian configuration does not allow words longer than 20 characters. Estonian configuration also penalizes sentences containing more than one adverb, one pronoun, one proper name, one numeral, one conjunction, one comma, or two verbs. There is a classifier that only selects sentences containing a verb and a classifier that penalises certain non-finite constructions.

Tools to work with the Estonian corpus

A complete set of tools is available to work with this Estonian corpus to generate:

  • word sketch – Estonian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

etSkELL 0.4

  • added dynamic structure attributes

etSkELL 0.3

  • initial version with 294 million tokens

References

Estonian GdEX configuration

Koppel, K. 2017. Heade näitelausete automaattuvastamine eesti keele õppesõnastike jaoks. Eesti Rakenduslingvistika Ühingu aastaraamat, 13, 53−71.10.5128/ERYa13.04.

Kosem, I., Koppel, K., Kuhn, T. Z., Michelfeit, J., Tiberius, C. 2018. “Identification and automatic extraction of good dictionary examples: the case(s) of GDEX. International Journal of Lexicography, ecy014, https://doi.org/10.1093/ijl/ecy01

SkELL corpus

BAISA, Vít a Vít SUCHOMEL. SkELL – Web Interface for English Language Learning. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, pp. 63-70. ISSN 2336-4289.

References to SkELL and versioning

From time to time, the underlying corpus data may change (cleaning, refining etc.). To refer to particular results (using bookmarked URLs for example), also refer to a particular version. The web interface may also change occasionally. Each SkELL page carries a version via link “Terms” in the left corner at the bottom, e.g. VERSION1-VERSION2. This refers to the version of the interface and the version of the corpus data respectively.

Search the Estonian corpus

Sketch Engine offers a range of tools to work with this Estonian corpus for the interface.

Other Estonian corpora

Explore our other Estonian language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.