Corpora are a good starting point for collecting historical texts. You can upload your texts in various formats (TXT, PDF, DOC, etc.) to create a corpus from all your files or use our tool for building corpora from the web, e.g. downloading specific websites containing historical texts or books. Corpora can be divided into smaller parts called subcorpora which allows you to work with only specific parts of the whole corpus, i.e. texts from a specific time period or texts of only one author, genre and the like.
Historical corpora:
- Corpus of English Dialogues 1560–1760 (English)
- Early English Books Online 1473–1820 (English)
- GerManC. A Historical Corpus of German Newspapers 1650–1800 (German)
- Penn Historical Corpora (English)
- Latin corpus (Latin)
Sketch Engine is also being used in the ChartEx project which applies text mining methods to medieval Latin charters. It will make the corpora publicly available through Sketch Engine as the project proceeds.
Reference
Adam Kilgarriff, Miloš Husák and Robyn Woodrow (2012). The Sketch Engine as infrastructure for historical corpora. In Jeremy Jancsary (ed.). Empirical Methods in Natural Language Processing; Proceedings of the Conference on Natural Language Processing 2012, pp. 351–356
Barbara McGillivray and Adam Kilgarriff (2012). Tools for historical corpus research, and a corpus of Latin (presentation). In New Methods in Historical Corpus Linguistics 3, Germany, 2013, pp. 247–255.