csSKELL: Czech Corpus for SKELL
The Czech Corpus for SKELL is a text database used by the Czech SKELL interface (csSKELL) available at https://skell.sketchengine.eu/#home?lang=cs. The corpus does not contain whole documents but only sentences sorted according to their text quality.
In terms of corpus search, this approach means:
- the previous sentence does not relate to the following sentence
- sentences of first concordances should be better than following ones in the point of less non-alphabet characters and interpunctions, more frequent words, etc.
The score of text quality was computed by the GDEX system.
The corpus is made up of websites classified by Czech Webarchiv in terms of selective harvests. The second source is 1800 crawled websites provided by Webarchiv. The next source is articles and talk pages from Czech Wikipedia (downloaded in April 2017) and texts from the domain .cz of Czech Timestamped web corpus.
The domain variety text collection within the corpus enables users to explore the Czech language in its everyday usage in the collection of 1.4 billion words in more than 90 million sentences.
|Source||no. of words||percentage|
|Webarchiv: selective harvests||~ 987,299,101||68.40 %|
|Webarchiv: other sources||~ 232,047,827||16.07%|
|Timestamped web corpus||~ 133,488,941||9.24 %|
|Wikipedia including talk pages||~ 90,575,062||6.27 %|
The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).
Tools to work with Czech SKELL corpus
A complete set of Sketch Engine tools is available to work with this Czech SKELL corpus to generate:
- word sketch – Czech collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of Czech nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
|VERSION||DESCRIPTION||Corpus size (words)|
|1.0||initial version without any cleaning||1,717,516,129|
|2.0||first published version with simply cleaning Slovak and texts without diacritics from corpus and removed headlines||1,608,867,697|
|2.1||further cleaning of Slovak texts and texts without diacritics;
removed sentences containing:
– automatically created texts by Wikipedia
– non-ASCII characters
– only nonalphabetical characters
– HTML tags
– URL and email addresses
|2.2 (current version)
||further cleaning of texts without diacritics, removed most sentences with GDEX value “0”,removed sentences starting with n-rams (Václav MORAVEC , | Moderátor – Václav Moravec), removed sentences not starting/ending with tag
|2.3 (is being prepared)||further cleaning of texts without diacritics, removed sentences containing hapax legomenon (word with only 1 occurrence in the whole corpus)
CUKR, Michal. Český korpus příkladových vět [online]. Brno, 2017 [cit. 2019-04-18]. Available from:
Czech SKELL Corpus
distribution of text sources
Webarchiv: selective harvests (68.40 %)
Webarchiv: other sources (16.07%)
Timestamped web corpus (9.24 %)
Wikipedia including talk pages (6.27 %)
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.