etSKELL: Estonian corpus for SKELL
The Estonian corpus for Learners 2018 (etSKELL) is an Estonian corpus made up of sentences collected from the Estonian NC 2017 corpus and the Estonian coursebook corpus 2018. It was specially built up in order to provide the best sentence examples using GDEX system (see more below). The corpus is created in cooperation with the Institute of the Estonian Language using a GDEX configuration for Estonian developed by Kristina Koppel.
SKELL is an abbreviation of Sketch Engine for Language Learning. It is a freely available web interface suitable for Estonian learning.
The etSKELL corpus was tagged by PoS tagger developed at the University of Tartu using the estNLTK tools Filosoft part-of-speech tagset.
Good sentence examples (GDEX)
The Estonian corpus for Learners 2018 (etSKELL) consists of separate sentences which were sorted according to their quality. This quality is computed by GDEX system that adds a score to each sentence. The score is mainly based on the sentence length (minimum and maximum length) and the word frequency of particular words which occur in the sentence. The sentences are sorted in the way that the sentences with the highest score are displayed as the first results of a concordance.
What is specific for Estonian GDEX configuration compared to other languages, is that shorter sentences and sentences with wider optimal intervals are preferred. Instead of penalizing long words, the Estonian configuration does not allow words longer than 20 characters. Estonian configuration also penalizes sentences containing more than one adverb, one pronoun, one proper name, one numeral, one conjunction, one comma, or two verbs. There is a classifier that only selects sentences containing a verb and a classifier that penalizes certain non-finite constructions.
Tools to work with the Estonian corpus
A complete set of tools is available to work with this Estonian corpus to generate:
- word sketch – Estonian collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- added dynamic structure attributes
- initial version with 294 million tokens
Estonian GdEX configuration
Koppel, K. 2017. Heade näitelausete automaattuvastamine eesti keele õppesõnastike jaoks. Eesti Rakenduslingvistika Ühingu aastaraamat, 13, 53−71.10.5128/ERYa13.04.
Kosem, I., Koppel, K., Kuhn, T. Z., Michelfeit, J., Tiberius, C. 2018. “Identification and automatic extraction of good dictionary examples: the case(s) of GDEX. International Journal of Lexicography, ecy014, https://doi.org/10.1093/ijl/ecy014
BAISA, Vít a Vít SUCHOMEL. SKELL – Web Interface for English Language Learning. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, pp. 63-70. ISSN 2336-4289.
References to SKELL and versioning
From time to time, the underlying corpus data may change (cleaning, refining etc.). To refer to particular results (using bookmarked URLs for example), also refer to a particular version. The web interface may also change occasionally. Each SKELL page carries a version via link “Terms” in the left corner at the bottom, e.g. VERSION1-VERSION2. This refers to the version of the interface and the version of the corpus data respectively.
Search the Estonian corpus
Use a free web interface suitable for Estonian learners.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.