Riznica: Croatian text corpus
The Riznica Croatian language corpus (CLC) is a text corpus made up of texts collected from various text sources. There are online articles, printed books, transcripts of recordings, etc. The corpus search can be restricted by a specific year, author, or title.
In detailed, text data includes, among others:
- Online newspapers, books, articles
- Printed and published books and other printed hard copies
- Digital files of printed books made available by publishers (e.g. Školska knjiga, and the Croatian Academy of Arts and Sciences)
- Transcriptions of collected data and recordings
- Various resources with online available documents.
The Riznica corpus was annotated by the ReLDI tool developed by Nikola Ljubešić. The tagger uses the following POS tagset of the MULTEXT-East project for Croatian.