PICAE: Pearson International Corpus of Academic English

The Pearson International Corpus of Academic English (PICAE) is a language corpus made up of texts collected from the Internet. PICAE comprises over 37 million words including 13 % spoken and 87 % written material covering American, Australian, British, Canadian and New Zealand English. Corpus texts include a wide range of academic subjects the four main academic disciplines, namely humanities, social science, natural & formal science and professions & applied sciences. Furthermore, it also comprises lectures, seminars, textbooks and journal articles at undergraduate as well as postgraduate levels, university administrative material, university magazines, TV and radio broadcasts, etc.

Data of the PICAE corpus was gathered from five different sources:

  • 19.6 million words from the World Wide Web
  • 12.1 million words from the Longman Higher Education textbooks
  • 0.7 million words from the Longman Spoken American Corpus
  • 4.4 million words from the British National Corpus
  • 0.4 million words of academic English from the American National Corpus

The material was also taken from the academic sections of the British National Corpus which comprises 56 articles from 13 different academic disciplines, e.g., literature, art, and chemistry published between 1975 and 1993.

The corpus was launched at IATEFL 2009, a full report is available at http://pearsonpte.com/wp-content/uploads/2014/07/RS_PICAE_2010.pdf

Part-of-speech tagset

The PICAE corpus is POS tagged by TreeTagger using the Penn Treebank tagset.

Access policy

To obtain authorization from Pearson to access the corpus:

  1. please contact Veronica Benigno veronica.benigno@pearson.com. Provide a brief description of your research and state your academic affiliation.
  2. Then get in touch with Sketch Engine at support@sketchengine.eu who will update your account permissions accordingly.

Tools to work with the Pearson International Corpus of Academic English

A complete set of Sketch Engine tools is available to work with this PICAE corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

version 3 (March 2017)

  • corpus tagged by the RFTagger tool with the NKJP tagset
  • created lempos

version 2 (1 July 2013)

  • corpus tagged by the WCRFT tagger

version 1 (23 July 2012)

  • initial version – 7.7 billion words, untagged

a sample for Cesar (25 October 2012)

  • 640 million words sample
  • tagged by WCRFT (source: Wayback Machine) with the NKJP tagset

Ackermann, K., De Jong, J. H. A. L., Kilgarriff, A., & Tugwell, D. (2011). The Pearson International Corpus of Academic English (PICAE). In Proceedings of Corpus Linguistics.

Search the English PICAE corpus

Sketch Engine offers a range of tools to work with the Pearson International Corpus of Academic English.

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.