PennHistEn: Penn Parsed Corpora of Historical English

The Penn Parsed Corpus of Historical English (PennHistEn) is an English corpus made up of English historical texts. This page refers to the version of PennHistEn corpus created for Sketch Engine. The original collection is distributed by the University of Pennsylvania (http://www.ling.upenn.edu/histcorpora/). Penn Historical Corpora is a collection of historical English texts ranging from Middle English to Modern British English (from mid 12th to early 20th century).

The Penn Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are running texts and text samples of British English prose across its history – from the earliest Middle English documents up to the First World War. The texts come in three forms: simple text, part-of-speech tagged text and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences but also for syntactic structure. All of the annotations have been carefully checked by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available to individuals, research groups and libraries.

The main author of the Penn Corpora of Historical English was Prof. Anthony Kroch (✝ 2021). The corpus is now maintained by Dr. Beatrice Santorini.

Tagsets

Access policy

Permission to access this corpus must be obtained from the copyright holder which is Linguistic Data Consortium (LDC) ldc@ldc.upenn.edu. Then please send us approval from them at support@sketchengine.eu that the Sketch Engine team can grant you access to this corpus.

Tools to work with the Penn Historical corpora

A complete set of tools is available to work with this PennHistEn corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Structures

documents

There were two sets of document meta-information in the original corpora, they are both present with ‘PPC_’ and ‘Helsinki_’ prefixes as arguments of structure. These values were merely minimally edited (expanded abbreviations, normalized letter cases, removed white symbols, etc..)

Two other attributes were devised from corresponding values from both annotations:

  • Author – equals to ‘PPC_Author’ if available, otherwise ‘Helsinki_Author’ with only first letters in uppercase
  • Title – equals to ‘PPC_Text_name’ if available, otherwise ‘Helsinki_Text_name’ with only first letters in uppercase
  • Date – 50 years wide intervals filling the period between the earliest to latest year mentioned in the date-related attributes:

comments

Various edits, comments, etc. are tagged using an unary tag . The content of the comment can be accessed as its attribute called ‘value’ and different origin or way the commenents were tagged in the original corpus is distinguished by its attribute ‘type’ .

Attributes

  • ascii – The original corpus used certain conventions to encode non-ascii characters, ligatures, superscripts, etc… The ascii attribute contains the original ascii encoded form of the tokens (see for details).
    • superscripts: =X=
    • accents: X’
    • non-ascii symbols:
ascii form uppper case symbol name ascii form lower case
+A Æ ash +a æ
+D Ð eth +d ð
+G Ȝ yogh +g ȝ
+TT, +Tt crossed thorn +tt
+T Þ thorn +t þ
e caudata +e ę
+o œ
+L £ pound sign
  • unicode attribute will be displayed “as-close-as-possible” to the original form of the text; most signs, ligatures, superscripts were converted to their unicode counterparts. The most prominent issue with conversion to the unicode is the fact that the original format does not distinguish between different accents, so all accents were just replaced with “combining vertical line above” (as a sort of neutral accent which makes appostrophe in they’re thethey̍re) and all abbreviations, flourishes, tildas etc… were replaced with “combining tilda” as in Cobh̃m.
  • word is the default form of tokens used in Sketch Engine; for practical reasons, it does not distinguish between different forms of superscipted forms and all ligatures and historical letters were replaced with their closest latin letter counterparts (following the ascii encoding, but ommiting the ‘+’ sign).
  • lc – lowercase normalized word form
  • tag – POS tag provided along with the original corpus

General information about the corpus

Frequency
Tokens 4,404,931
Words 3,800,639
Documents 605

Conversion process

The original corpus consists of three parts (Penn-Helsinki Parsed Corpus of Middle English, second edition – PPCME2, the Penn-Helsinki Parsed Corpus of Early Modern English – PPCEME, and the Penn Parsed Corpus of Modern British English – PPCMBE) that differ a bit in the way they were annotated.

The Sketch Engine version of PennHistEn contains almost all metadata and tagging as the original corpus (which itself retains most, but not all of the markup of the source corpora – e.g. line breaks and paragraphs were not preserved – more info) and it was further normalized so as it was easier to treat the whole collection as one corpus.

Document metadata

Empty values (all “X”es and “n/a”) were replaced with “===NONE===” value.

Bibliography and how to cite this corpus

Original corpus

Authors

  • Anthony Kroch and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Ariel Diertani. 2010. The Penn-Helsinki Parsed Corpus of Modern British English (PPCMBE). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).


How to cite

The Penn Parsed Corpora of Historical English should be cited individually rather than as a single bibliographic entry. The citation should include the website of the corpus, its edition, and its date of release. Here are the proper citations as of June 1, 2011:

  • Anthony Kroch and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Ariel Diertani. 2010. The Penn-Helsinki Parsed Corpus of Modern British English (PPCMBE). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).

see more

Search Corpus of Historical English

Sketch Engine offers a range of tools to work with this Penn Corpus of Historical English.

or

English Trends corpus

Explore our largest English corpus, which totals over 80 billion words and grows automatically every week.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.