Maltese Reference Corpus
The Maltese Reference Corpus is a Maltese corpus made up of texts collected from the Internet.
The corpus consists of 460 million words.
Part-of-speech tagset and lemmatization
The corpus uses the following Maltese part-of-speech tagset. It is also lemmatized.
Maltese Reference Corpus corpus sizes
| Number of words | 460+ million |
| Number of tokens | 530+ million |
| Number of sentences | 18+ million |
| Number of web pages | 460+ thousand |
Search the Maltese Reference Corpus
Sketch Engine offers a range of tools to work with this Maltese corpus.
Tools to work with the Maltese Reference Corpus from the web
A complete set of Sketch Engine tools is available to work with this Maltese corpus to generate:
- word sketch – Maltese collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word units
- word lists – lists of Maltese nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Changelog
Maltese Reference Corpus
An all-purpose Maltese corpus covering the largest possible variety of genres, topics, text types and web sources. Recommended for both general use and also specialized language. Maltese Reference Corpus is a compilation of corpus Malti v. 4.2 (a general corpus of maltese from 2024) and web corpus MaCoCu Maltese 2021 v. 2 and Maltese Trends (web pages published in newsfeeds) from 2023-08 to 2025-12 and Maltese Wikipedia from March 2025 and Maltese web pages downloaded from September to October 2021.
Bibliography
TenTen corpora
SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Genre annotation
SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.




