Maltese Reference Corpus

The Maltese Reference Corpus is a large Maltese corpus built from a wide range of sources. It contains 460 million words, and was created by combining several resources:

Korpus Malti v4.2 (2024)

A Maltese corpus comprising both online and offline sources. It accounts for 66.6% of the Maltese Reference Corpus (353 million words). See the Korpus Malti v4.2 page for more details.

Maltese Web 2021 (mtTenTen21)

Part of the TenTen corpus family, this corpus is built using technology designed to collect linguistically valuable web content. It represents 17.9% of the corpus (95 million words).

MaCoCu Maltese Web v2 (2021)

A Maltese web corpus comprising online and offline sources. It contributes 11.7% of the Maltese Reference Corpus (62 million words). See more information on the MaCoCu corpora page.

Maltese Trends (2023–2025)

A monitor corpus consisting of news articles and other regularly updated sources collected via RSS feeds. It accounts for 2.5% of this Maltese corpus (13 million words). For more details, visit the Maltese Trends corpus page.

Maltese Wikipedia (as of March 2025)

A corpus based on Maltese Wikipedia, contributing 5.8 million words (1.1%).

Part-of-speech tagset and lemmatization

The corpus is part-of-speech tagged by Maltese BERTu using the following Maltese part-of-speech tagset, indicating the part of speech and grammatical category. The corpus texts also contain lemmatization (provided by the cstlemma lemmatizer trained on the Gabra lexicon) when each word form from the corpus is assigned to its base form (lemma).

Maltese Reference Corpus – corpus sizes

Number of words 460+ million
Number of tokens 530+ million
Number of sentences 18+ million
Number of web pages 460+ thousand

Genre and topic classification

A part of the Maltese Reference Corpus is annotated with topics (22% of the whole corpus) and genres (82% of the whole corpus).

Search the Maltese Reference Corpus

Sketch Engine offers a range of tools to work with this Maltese corpus.

Tools to work with the Maltese Reference Corpus

A complete set of Sketch Engine tools is available to work with this Maltese corpus to generate:

  • word sketch – Maltese collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word units
  • word lists – lists of Maltese nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Maltese Reference Corpus

  • published in February 2026
  • tokens: 530,879,049
  • word: 460,079,532
  • consists of:
    • Korpus Malti v4.2 (2024)
    • Maltese Web 2021 (mtTenTen21)
    • MaCoCu Maltese Web v2 (2021)
    • Maltese Trends (2023–2025)
    • Maltese Wikipedia (as of March 2025)

Korpus Malti v4.2

Micallef, K., Gatt, A., Tanti, M., van der Plas, L., & Borg, C. (2022). Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing (pp. 90–101). Association for Computational Linguistics.

Maltese Web 2021 (mtTenTen21) – TenTen corpora

Suchomel, V. (2020). Better web corpora for corpus linguistics and NLP (Doctoral thesis, Masaryk University, Faculty of Informatics). https://is.muni.cz/th/u4rmz/

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The TenTen corpus family. In Proceedings of the 7th International Corpus Linguistics Conference (pp. 125–127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the Seventh Web as Corpus Workshop (WAC7) (pp. 39–43).

MaCoCu Maltese Web v2 (2021) – MaCoCu corpora

Bañón, M., Esplà-Gomis, M., Forcada, M. L., García-Romero, C., Kuzman, T., Ljubešić, N., van Noord, R., Pla Sempere, L., Ramírez-Sánchez, G., Rupnik, P., Suchomel, V., Toral, A., van der Werff, T., & Zaragoza, J. (2022). MaCoCu: Massive collection and curation of monolingual and bilingual data: Focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation (pp. 303–304). European Association for Machine Translation.

Maltese Trends corpus – Trends corpora

Herman, O., Jakubíček, M., Kraus, J., & Suchomel, V. (2025). From word of the year to word of the week: Daily-updated monitor corpora for 25 languages. In Proceedings of the eLex 2025 conference (pp. 44–61).

Herman, O., Kraus, J., & Suchomel, V. (2026). FeedFetcher: A resilient web feed downloader for corpus construction. In Proceedings of the Fifteenth International Conference on Language Resources and Evaluation (LREC 2026). (Forthcoming)

Genre annotation

(the Maltese Reference Corpus uses a mixed classification of genres and topics that does not strictly follow the definition in the paper below)

Suchomel, V. (2021). Genre Annotation of Web Corpora: Scheme and Issues. In K. Arai, S. Kapoor, & R. Bhatia (Eds.), Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1 (pp. 738–754). Springer. https://doi.org/10.1007/978-3-030-63128-4_55

Other Maltese corpora

Sketch Engine offers dozens of Maltese language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.