Czech Corpus for SKELL | Sketch Engine

csSKELL: Czech Corpus for SKELL

The Czech Corpus for SKELL is a text database used by the Czech SKELL interface (csSKELL). The corpus does not contain whole documents but only sentences sorted according to their text quality.

In terms of corpus search, this approach means:

the previous sentence does not relate to the following sentence
sentences of first concordances should be better than following ones in the point of less non-alphabet characters and interpunctions, more frequent words, etc.

The score of text quality was computed by the GDEX system.

The corpus is made up of websites classified by Czech Webarchiv in terms of selective harvests. The second source is 1800 crawled websites provided by Webarchiv. The next source is articles and talk pages from Czech Wikipedia (downloaded in April 2017) and texts from the domain .cz of Czech Timestamped web corpus.

The domain variety text collection within the corpus enables users to explore the Czech language in its everyday usage in the collection of 1.4 billion words in more than 90 million sentences.

What is SKELL?

SKELL (Sketch Engine for Language Learning) is a simple tool for students and teachers of language to easily check whether or how a particular phrase or a word is used by real speakers of a language.

No registration or payment required. Just type a word and click a button.

All examples, collocations and synonyms were identified automatically by ingenious algorithms and state-of-the-art software analysing large multi-billion samples of text. No manual work was involved.

csSKELL is a Czech version of the SKELL tool based on Sketch Engine.

more about SKELL

Statistics

Source	no. of words	percentage
Webarchiv: selective harvests	~ 987,299,101	68.40 %
Webarchiv: other sources	~ 232,047,827	16.07%
Timestamped web corpus	~ 133,488,941	9.24 %
Wikipedia including talk pages	~ 90,575,062	6.27 %
Total	1,443,410,941	100,00%

Availability

The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).

Tools to work with Czech SKELL corpus

A complete set of Sketch Engine tools is available for working with this Czech SKELL corpus and generating:

word sketch – Czech collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Czech nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency lists of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

VERSION	DESCRIPTION	Corpus size (words)
1.0	initial version without any cleaning	1,717,516,129
2.0	first published version with simply cleaning Slovak and texts without diacritics from corpus and removed headlines	1,608,867,697
2.1	further cleaning of Slovak texts and texts without diacritics; removed sentences containing: – automatically created texts by Wikipedia – non-ASCII characters – only nonalphabetical characters – HTML tags – URL and email addresses	1,552,052,945
2.2 (current version)	further cleaning of texts without diacritics, removed most sentences with GDEX value “0”,removed sentences starting with n-rams (Václav MORAVEC , \| Moderátor – Václav Moravec), removed sentences not starting/ending with tag	1,443,410,941
2.3 (is being prepared)	further cleaning of texts without diacritics, removed sentences containing hapax legomenon (word with only 1 occurrence in the whole corpus)	–

Bibliography

CUKR, Michal. Český korpus příkladových vět [online]. Brno, 2017 [cit. 2019-04-18]. Available from: . Master’s thesis. Masaryk University, Faculty of Arts. Thesis supervisor Vít Baisa.

Czech SKELL Corpus

distribution of text sources

Webarchiv: selective harvests (68.40 %)

Webarchiv: other sources (16.07%)

Timestamped web corpus (9.24 %)

Wikipedia including talk pages (6.27 %)

Try csSKELL

pro studenty češtiny

about Sketch Engine

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

corpora in Sketch Engine

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

csSKELL: Czech Corpus for SKELL

What is SKELL?

Statistics

Availability

Tools to work with Czech SKELL corpus

Czech SKELL Corpus

Other text corpora in Sketch Engine

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine