PICAE: Pearson International Corpus of Academic English
The Pearson International Corpus of Academic English (PICAE) is a language corpus made up of texts collected from the Internet. PICAE comprises over 37 million words including 13 % spoken and 87 % written material covering American, Australian, British, Canadian and New Zealand English. Corpus texts include a wide range of academic subjects the four main academic disciplines, namely humanities, social science, natural & formal science and professions & applied sciences. Furthermore, it also comprises lectures, seminars, textbooks and journal articles at undergraduate as well as postgraduate levels, university administrative material, university magazines, TV and radio broadcasts, etc.
Data of the PICAE corpus was gathered from five different sources:
- 19.6 million words from the World Wide Web
- 12.1 million words from the Longman Higher Education textbooks
- 0.7 million words from the Longman Spoken American Corpus
- 4.4 million words from the British National Corpus
- 0.4 million words of academic English from the American National Corpus
Material was also taken from the academic sections of the British National Corpus which comprises 56 articles from 13 different academic disciplines, e.g., literature, art, chemistry published between 1975 and 1993.
The corpus was launched at IATEFL 2009, a full report is available at http://pearsonpte.com/wp-content/uploads/2014/07/RS_PICAE_2010.pdf
The PICAE corpus is POS tagged by TreeTagger using the Penn Treebank tagset.
To obtain authorisation from Pearson to access the corpus:
Tools to work with the PICAE corpus
A complete set of Sketch Engine tools is available to work with this English Academic corpus to generate:
- word sketch – English collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
version 3 (March 2017)
- corpus tagged by the RFTagger tool with the NKJP tagset
- created lempos
version 2 (1 July 2013)
- corpus tagged by the WCRFT tagger
version 1 (23 July 2012)
- initial version – 7.7 billion words, untagged
a sample for Cesar (25 October 2012)
Ackermann, K., De Jong, J. H. A. L., Kilgarriff, A., & Tugwell, D. (2011). The Pearson International Corpus of Academic English (PICAE). In Proceedings of Corpus Linguistics.
Search the English PICAE corpus
Sketch Engine offers a range of tools to work with the Pearson International Corpus of Academic English.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.