MATAS: the Morphologically Annotated Lithuanian Corpus
The Morphologically Annotated Lithuanian Corpus (MATAS) is a language corpus made up of different text genres. The corpus was compiled and prepared by the Center of Computational Linguistics (CCL) at Vytautas Magnus University. The corpus consists of 739,176 words with manual annotation which indicates detail grammatical category. Texts are extracted from the Corpus of the Contemporary Lithuanian Language at CCL (100-million-word corpus).
For more information see https://clarin.vdu.lt/xmlui/handle/20.500.11821/9?show=full
Part-of-speech tagset
MATAS corpus is manually annotated at morphological level with the following POS tagset.
Access policy
Access to the corpus is only limited to academic use. To gain access, send an email to support@sketchengine.co.uk with a proof of your academic affiliation.
Distribution of text genres
Tools to work with the Morphologically Annotated Lithuanian Corpus
A complete set of Sketch Engine tools is available to work with this Lithuanian corpus to generate:
- word lists – lists of Lithuanian nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Bibliography
Rimkutė, E. (2014). Lithuanian morphologically annotated corpus-MATAS, CLARIN-LT digital library in the Republic of Lithuania, http://hdl.handle.net/20.500.11821/9.
Search MATAS corpus
Sketch Engine offers a range of tools to work with the Morphologically Annotated Lithuanian Corpus.
or
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.