CoPEP – academic Portuguese corpus

CoPEP: Corpus of Portuguese from Academic Journals

The CoPEP Corpus (Corpus de Português Escrito em Periódicos) is a synchronic corpus of Portuguese made up of around 10.000 texts collected from academic journals from Brazil and Portugal. The corpus was prepared especially for a lexicographic project focussed on designing an online corpus-driven dictionary of Portuguese for university students (Kuhn, 2017). The corpus contains approximately 40 million words, which are distributed among three Schools of Knowledge, and further divided into six Great Areas (according to CAPES classification).

The subcorpora for each language variety are of almost the same size and consist of a similar number of words per both Great Areas and Schools, making the corpus evenly balanced. Metadata on the texts have been carefully recorded in order to allow advanced corpus search options, e.g. year of publication, Great Area of Knowledge and ISSN number.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brasil (CAPES) -Finance Code 001 and in part by the Fundação para a Ciência e a Tecnologia de Portugal, through the Strategic Project of CELGA-ILTEC at University of Coimbra (POCI-01-0145-FEDER-006986 – UID/LIN/04887/2013).

Authors of corpus

The Corpus of Portuguese from Academic Journals was created by Tanara Zingano Kuhn and José Pedro Ferreira in 2018. For more information about the corpus, please contact Tanara Zingano Kuhn at tanarazingano(a)outlook.com

Copyright & financing

Texts in the corpus are provided under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

VÃ½sledek obrÃ¡zku pro Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Part-of-speech tagset

The Portuguese CoPEP corpus was tagged by FreeLing using EAGLES PoS tags.

Corpus metadata - categories and names of attributes

Texts are distributed into three Schools of Knowledge, and further divided into six Great Areas (according to CAPES classification).

Colégios	*Colégio de Humanidades* *(HU)*		*Colégio de Ciências da Vida* *(CV)*		*Colégio de Ciências Exatas, da Terra e Multidisciplinar* *(CE)*
Grandes áreas	Ciências Humanas (Hu)	Ciências Socias Aplicadas (Ap)	Ciências da Saúde (He)	Ciências Agrícolas (Ag)	Engenharia (En)	Ciências Exatas e da Terra (Ex)

Name of attributes

Attributes in English	Attributes in Portuguese
variety	variedade
source	fonte
school	colegio
great_area	grande_area
issn	issn (no change)
year	ano
issue	num_edicao
article_num	num_artigo

Names of attribute values

Attributes	Values in English	Values in Portuguese
great_area	Exact-Earth Sciences	Exatas e da Terra
great_area	Engineering	Engenharia
school	Ex-Tech-Multi Sciences	Ciencias da Terra, Exatas e Multidisciplinar
great_area	Health Sciences	Ciencias da Saude
great_area	Agricultural Sciences	Ciencias Agricolas
school	Life Sciences	Ciencias da Vida
great_area	Applied Social Sciences	Ciencias Socias Aplicadas
great_area	Human Sciences	Ciencias Humanas
school	Humanities	Humanidades

Tools to work with the CoPEP corpus

A complete set of tools is available to work with this academic Portuguese corpus to generate:

word sketch – Portuguese collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Portuguese nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word units
text type analysis – statistics of metadata in the corpus

Bibliography & how to cite this corpus

How to cite

Tanara Zingano Kuhn & José Pedro Ferreira (2018). CoPEP – Corpus de Português Escrito em Periódicos (v.1.4)

Bibliography

Kuhn, Tanara Zingano; Ferreira, José Pedro (2018). Introducing CoPEP, the Corpus de Português Escrito em Periódicos (Corpus of Portuguese from Academic Journals). In: 14th American Association for Corpus Linguistics (AACL) Conference, p. 61.
Kuhn, Tanara Zingano (2017). A design proposal of an online corpus-driven dictionary of Portuguese for university students. Tese de Doutoramento em Linguística Aplicada. Lisboa: Universidade de Lisboa.

Search the CoPEP corpus

Sketch Engine offers a range of tools to work with this Portuguese corpus from Portuguese academic journals.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

CoPEP: Corpus of Portuguese from Academic Journals

Authors of corpus

Copyright & financing

Part-of-speech tagset

Tools to work with the CoPEP corpus

How to cite

Bibliography

Search the CoPEP corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

CoPEP – Corpus of Portuguese from Academic Journals

CoPEP: Corpus of Portuguese from Academic Journals

Authors of corpus

Copyright & financing

Part-of-speech tagset

Tools to work with the CoPEP corpus

How to cite

Bibliography

Search the CoPEP corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine