CoPEP: Corpus of Portuguese from Academic Journals
The CoPEP Corpus (Corpus de Português Escrito em Periódicos) is a synchronic corpus of Portuguese made up of around 10.000 texts collected from academic journals from Brazil and Portugal. The corpus was prepared especially for a lexicographic project focussed on designing an online corpus-driven dictionary of Portuguese for university students (Kuhn, 2017). The corpus contains approximately 40 million words, which are distributed among three Schools of Knowledge, and further divided into six Great Areas (according to CAPES classification).
The subcorpora for each language variety are of almost the same size and consist of a similar number of words per both Great Areas and Schools, making the corpus evenly balanced. Metadata on the texts have been carefully recorded in order to allow advanced corpus search options, e.g. year of publication, Great Area of Knowledge and ISSN number.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brasil (CAPES) -Finance Code 001 and in part by the Fundação para a Ciência e a Tecnologia de Portugal, through the Strategic Project of CELGA-ILTEC at University of Coimbra (POCI-01-0145-FEDER-006986 – UID/LIN/04887/2013).
Authors of corpus
The Corpus of Portuguese from Academic Journals was created by Tanara Zingano Kuhn and José Pedro Ferreira in 2018. For more information about the corpus, please contact Tanara Zingano Kuhn at
Copyright & financing
Texts in the corpus are provided under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
The Portuguese CoPEP corpus was tagged by FreeLing using EAGLES PoS tags.