Historical Corpus of German Newspapers 1650–1800

The GerManC corpus is a representative Historical Corpus of German Newspapers of the period 1650–1800 distributed by University of Oxford Text Archive.

The corpus consists of short text samples of some 200 words each from German newspapers of the early modern period 1650–1800. The corpus metainformation contains full bibliographic details of the original texts, e.g. region, genre, year of publication, author, title, etc. Texts are divided into three main parts fifty-year subperiod (1650-1700, 1701-1750 and 1751-1800).

Conversion process

The GerManC corpus in Sketch Engine is based on its LING-GATE version that contains both linguistic and structural annotations. All annotations, except those annotating subparts of certain tokens were preserved. Certain phrases (such as headings, acts, speakers, etc.), however, were not annotated within GerManC, thus the values of corresponding attributes were left blank.

Finally, the whole corpus was retagged with the standard tree-tagger analyzator to provide word sketches which enable to explore the grammatical behavior of German in the early modern period.

Part-of-speech tagset

The GerManC POS tagging scheme is based on the STTS tagset for German, with a number of modifications to account for differences between modern and Early Modern German. The POS annotations in GerManC were produced by the re-trained version of the TreeTagger tool. See the STTS tagset for German.


For all tokens:

  • word – original word form
  • tag – TreeTagger output (see the tagset summary)
  • lempos – lemma+part_of_speech (based on TreeTagger output)

Based on original tagging (partially unavailable):

  • lemma – base lemma (in its modern form)
  • norm – normalized word form
  • lc – lowercase normalized word form
  • morph – morphological information
  • tag2 – part-of-speech (original tagger output)
  • ptag – syntactic category (original tagger output)
  • kind – (word, number, punctuation, etc…)
  • pID – word id in sentence (used by parser)
  • pDepID – dependency relation (parser output)


The corpus was prepared by Martin Durrell; Paul Bennett; Silke Scheible; Richard J. Whitt.

Durrell, Martin; Ensslin, Astrid and Bennett, Paul (eds.). GerManC. A Historical Corpus of German Newspapers 1650-1800 [Electronic resource].

