PICAE: Pearson International Corpus of Academic English
The Pearson International Corpus of Academic English (PICAE) is a language corpus made up of texts collected from the Internet. PICAE comprises over 37 million words including 13 % spoken and 87 % written material covering American, Australian, British, Canadian and New Zealand English. Corpus texts include a wide range of academic subjects the four main academic disciplines, namely humanities, social science, natural & formal science and professions & applied sciences. Furthermore, it also comprises lectures, seminars, textbooks and journal articles at undergraduate as well as postgraduate levels, university administrative material, university magazines, TV and radio broadcasts, etc.
Data of the PICAE corpus was gathered from five different sources:
- 19.6 million words from the World Wide Web
- 12.1 million words from the Longman Higher Education textbooks
- 0.7 million words from the Longman Spoken American Corpus
- 4.4 million words from the British National Corpus
- 0.4 million words of academic English from the American National Corpus
Material was also taken from the academic sections of the British National Corpus which comprises 56 articles from 13 different academic disciplines, e.g., literature, art, chemistry published between 1975 and 1993.
The corpus was launched at IATEFL 2009, a full report is available at http://pearsonpte.com/wp-content/uploads/2014/07/RS_PICAE_2010.pdf
The PICAE corpus is POS tagged by TreeTagger using the Penn Treebank tagset.
To obtain authorisation from Pearson to access the corpus:
- please contact Veronica Benigno firstname.lastname@example.org. Provide a brief description of your research and state your academic affiliation.
- Then get in touch with Sketch Engine at email@example.com who will update your account permissions accordingly.