CZES is a Czech corpus consisting of newspaper articles and magazine articles from years 1995–1998 and 2002.

  • The data was downloaded from trafika.cz and newspapers’ home sites: Lidové noviny, Mladá fronta, Českomoravský profit, Právo and other.
  • Some data (articles, books) was taken from many small websites (students’ work).
  • Another part was obtained from CD archives of PC magazines.
  • Some parts were taken from newspapers’ home sites were added around year 2002 (students’ work).

Tagging

Czes was tagged using Ajka tags.

Changelog

v2.0 (26 October 2010)

  • removed duplicate and near-duplicate documents

v2.1 (2015)

  • retokenised and retagged