Maltese Reference Corpus

The Maltese Reference Corpus is a Maltese corpus made up of texts collected from the Internet.

The corpus consists of 460 million words.

Part-of-speech tagset and lemmatization

The corpus uses the following Maltese part-of-speech tagset. It is also lemmatized.

Maltese Reference Corpus corpus sizes

Number of words 460+ million
Number of tokens 530+ million
Number of sentences 18+ million
Number of web pages 460+ thousand

Search the Maltese Reference Corpus

Sketch Engine offers a range of tools to work with this Maltese corpus.

Tools to work with the Maltese Reference Corpus from the web

A complete set of Sketch Engine tools is available to work with this Maltese corpus to generate:

  • word sketch – Maltese collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word units
  • word lists – lists of Maltese nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Maltese Reference Corpus

An all-purpose Maltese corpus covering the largest possible variety of genres, topics, text types and web sources. Recommended for both general use and also specialized language. Maltese Reference Corpus is a compilation of corpus Malti v. 4.2 (a general corpus of maltese from 2024) and web corpus MaCoCu Maltese 2021 v. 2 and Maltese Trends (web pages published in newsfeeds) from 2023-08 to 2025-12 and Maltese Wikipedia from March 2025 and Maltese web pages downloaded from September to October 2021.

TenTen corpora

SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Genre annotation

SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.

Other Maltese corpora

Sketch Engine offers dozens of Maltese language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.