Historical Corpus of German Newspapers 1650–1800
The GerManC corpus is a representative Historical Corpus of German Newspapers of the period 1650–1800 distributed by University of Oxford Text Archive.
The corpus consists of short text samples of some 200 words each from German newspapers of the early modern period 1650–1800. The corpus metainformation contains full bibliographic details of the original texts, e.g. region, genre, year of publication, author, title, etc. Texts are divided into three main parts fifty-year subperiod (1650-1700, 1701-1750 and 1751-1800).
The GerManC corpus in Sketch Engine is based on its LING-GATE version that contains both linguistic and structural annotations. All annotations, except those annotating subparts of certain tokens were preserved. Certain phrases (such as headings, acts, speakers, etc.), however, were not annotated within GerManC, thus the values of corresponding attributes were left blank.
Finally, the whole corpus was retagged with the standard tree-tagger analyzator to provide word sketches which enable to explore the grammatical behavior of German in the early modern period.
The GerManC POS tagging scheme is based on the STTS tagset for German, with a number of modifications to account for differences between modern and Early Modern German. The POS annotations in GerManC were produced by the re-trained version of the TreeTagger tool. See the STTS tagset for German.