maCoCu: Corpora from the Web
The MaCoCu corpora were built by crawling the Internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler.
Considerable effort was devoted into cleaning the extracted texts to provide a high-quality web corpora. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest domains were manually checked and bad domains, such as machine-translated domains, were removed. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpora highly useful for corpus linguistics studies, as well as for training language models and other language technologies.
Thanks to the MaCoCu project, corpora in multiple languages are now available in Sketch Engine. If you want to find out more about this project and individual corpora, please refer to this website: https://macocu.eu/
Search the MaCoCu corpora
Sketch Engine offers a range of tools to work with these MaCoCu corpora.
Overview of MaCoCu corpora
The following MaCoCu corpora are available in Sketch Engine:
- MaCoCu Croatian Web v2 (2021–2022) – 2.3 billion words
- MaCoCu Bosnian Web v1 (2021-2022) – 715 million words
- MaCoCu Slovene Web v2 (2021-2022) – 1.8 billion words
- MaCoCu Ukrainian Web v1 (2021-2022) – 5.9 billion words
- MaCoCu Macedonian Web v2 (2021) – 512 million words
Tools to work with the MaCoCu corpora from the web
A complete set of Sketch Engine tools is available to work with these MaCoCu corpora to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Note: Some of the functions may not be available for some of the MaCoCu corpora.
- MaCoCu Bosnian Web v1 (2021-2022) (November 2023)
- MaCoCu Croatian Web v2 (2021–2022) (November 2023)
- MaCoCu Slovene Web v2 (2021-2022) (November 2023)
- MaCoCu Ukrainian Web v1 (2021-2022) (November 2023)
- MaCoCu Macedonian Web v2 (2021-2022) (November 2023)
Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, and Jaume Zaragoza. 2022. MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.