Transhistorical Corpus of Written English

The Transhistorical Corpus of Written English (TCWE) is a diachronic text corpus developed at Edge Hill University as part of a project which has taken place between 2019–2021, directed by Dr. Imogen Marcus and with the assistance of Dr. Ursula Maden-Weinberger. The corpus contains five sub-corpora: sermons, statutes, letters, emails, and instant messages. Please see the infographic below:

The texts within the corpus range in date from the fifteenth to the twenty-first century. See Table 1 below, which details how many tokens are in each sub-corpora by century, and Table 2, which details how many words are in each sub-corpora by century. Sermons, statutes, and letters stretch back to the medieval period, whilst email and instant messaging are confined to the twenty-first century.

Go to corpus

Table 1: Number of tokens per century in individual text type sub-corpora

	Text types
	Sermons	Letters	Instant messaging	Email	Statutes
Century
15^th C	24579	24507	0	0	21532
16^th C	23752	2365	0	0	22447
17^th C	24157	26904	0	0	21621
18^th C	24079	28056	0	0	22719
19^th C	22404	23134	0	0	21359
20^th C	23055	22503	0	0	25308
21^st C	23021	0	83196	51072	25377
Total	165047	148756	83196	51072	160363

Table 2: Number of words per century in individual text type sub-corpora

	Text types
	Sermons	Letters	Instant messaging	Email	Statutes
Century
15^th C	20456	20819	0	0	20307
16^th C	19846	20139	0	0	20666
17^th C	19930	22718	0	0	20599
18^th C	20336	21657	0	0	20182
19^th C	19393	19880	0	0	18949
20^th C	19823	20122	0	0	21225
21^st C	19966	0	50990	44354	19273

Grammatical tagging

Since Sketch Engine can apply a tagger without any additional effort, all Modern English texts in the corpus, i.e. those dating from the 20^th and 21^st Centuries, were tagged and lemmatized using TreeTagger. This tagging is useful for anyone working with these Modern English documents. However, it should be noted that this tagging only applies to these modern texts. It cannot be reliably used in relation to texts in the corpus dating from before the 20^th Century, because there is a much higher degree of spelling variation in these texts. There are also words used which have fallen out of use in Modern English. These words are usually tagged as nouns and are not lemmatized.

Annotation consistency

We have made sure that the annotation across the Transhistorical Corpus of Written English is consistent. For more information, please see the table below.

Annotation table

Characteristic	Specific character	CEECS	Innsbruck	EEBO/ECCO	CLEP	SketchEngine
						FIND	REPLACE	This affects ONLY texts
Early letters		+	represented as characters	converted into modern equivalents	n/a
	Ash	+A	Æ			+A	AE	L15, L16
	ash	+a	æ			+a	ae	L15, L16
	Eth	+D	Ð			+D	Th	L15, L16
	eth	+d	ð			+d	th	L15, L16
	Yogh	+G	3			+G	Ȝ	L15, L16
	yogh	+g	3			+g	ȝ	L15, L16
	Thorn	+T	Þ			+T	Th	L15, L16
	thorn	+t	þ			+t	th	L15, L16
	Crossed Thorn	+TT	Ꝥ			+TT	Th	L15, L16
	Crossed Thorn	+Tt				+Tt	th	L15, L16
	crossed thorn	+tt	ꝥ			+tt	th	L15, L16
	+e	e caudata		ae		+e	ae	L15, L16
	+L	£ (pound sign)	£			+L	£	L15, L16
						3	ȝ	S15_001
						Þ	Th	S15_001
						þ	th	S15_001

Abbreviations
	tilde or dash above letter, flourish, apostrophe within word	letter followed by ~ (e.g. p~vided)	followed editors, either extending or printing as is	dash above letter (e.g. declaracōn for declaracion)		ō	o~	S15, S16, S17, S18
						ū	u~	S15, S16, S17, S18
						n̄	n~	S15, S16, S17, S18
						ē	e~	S15, S16, S17, S18
						ȳ	y~	S15, S16, S17, S18
						ā	a~	S15, S16, S17, S18
						m̄	m~	S15, S16, S17, S18
						p̄	p~	S15, S16, S17, S18
						q̄	q~	S15, S16, S17, S18
						ī	i~	S15, S16, S17, S18
						ꝓ	p~	S15, S16, S17, S18


				fecimꝰ = abbreviation “us” –> fecimus		ꝰ	us	S15, S16
	&	& = and	& = and	& = and		&	and
	y.
Superscripts
	superscript e.g. t, r etc	between == (e.g. w=t=)	between ==	^ (e.g. w^t)	between ==	^[any letters]	=any letters=	S15, S16, S17, S18
Accents
	é, è etc.	any accent replaced by accent grave ` after letter		on letter (e.g. ô dearely beloved)		é	e`
						ô	o`
Text Level Codes
	editors’ comments	[\…\]	\|[…] or […] within words	[…], also page numbers		[\…\]	delete	L15-L19
						\|[…]	delete	L15-L19
	font other than basic font	(^…^)				(^	delete	L15-L19
						^)	delete	L15-L19
	foreign language	(\…\)			(\…\)	(\	delete	L15-L19
						\)	delete	L15-L19
	emendations	[{…..{]				[{	delete	L15-L19
						{]	delete	L15-L19
						[}	delete	L15-L19
	heading	[}…}]	\|… (also page/line numbers, remarks)			}]	delete	L15-L19
	Corpus Coder comment	[^…^]			[^…^]	[^…^]	leave as is	L15-L19
	metadata	<…>	\|<...>		<…> (metadata)
	special initials		\|
	folio references		\|r[f.8v]
	paragraph marker			¶
	deviant word joining				% (e.g. Iam = I %am)	%	delete	L18, L19
	uncertain letters				{…}	{	delete	L18, L19
	unreadable letters				{**…}	}	delete	L18, L19
	omissions	^…^			^…^, e.g. I had the pleasure of seeing ^you^ but it	^…^	leave as is	L18, L19

Corpus rationale

The corpus has been designed to investigate innovation in digital written language, in particular the way it has been previously been conceptualised as a hybrid of speech and writing, in a historical context. It is for this reason that the corpus contains sermons (towards the speech end of a conceptual speech-writing continuum), statutes (towards the writing end of a conceptual speech-writing continuum), as well as letters, email and instant messages. However, the corpus does not need to be used for just this purpose.

The corpus contains a large amount of metadata and each user can therefore use many search criteria, including text type, century, in the case of letters, author name, author gender, recipient name and recipient gender. Below is a table which outlines what each metadata label means, and which text type sub-corpus it applies to.

Go to corpus

Metadata in the corpus and what it means

Metadata

Metadata label	What it means	Which text type sub-corpus it applies to
Text ID	The ID number of each individual text file in the corpus. See text ID key below this table.	Every sub-corpus
Text type	You can search for each of the five text types in the corpus: sermons, statutes, letters, email and instant messaging.	Every sub-corpus
Century	These are the different centuries the texts date from. They include: 15^th Century, 16^thCentury, 17^th Century, 18^th Century, 19^th Century, 20^thCentury, 21^st Century.	Every sub-corpus
Year	Specific year in which the text was composed.	Every sub-corpus
Date	Specific day and month on which the text was composed if known.	Email and 15th, 16^th, 18^th, 19^th Century letters.
Source	Source from which the data is taken. This may be a previously established corpus, website, archive or transcriptions made by the project members. See the ‘copyright’ section below for more information.	Every sub-corpus
Collection	Particular collections, e.g. letter collections, that the data has been sourced from.	15^th Century Statutes, 20^th Century Sermons, 15-17^th Century letters
Original ID	If the text files in a particular sub-corpus are taken from a previously created corpus, e.g. the Corpus of Early English Correspondence (CEEC), these are the text IDs that were originally assigned to them in that source corpora.	15-17^th Century letters, 20^th Century letters
Author name	The author’s name	Instant messaging (although nb anonymised in this sub-corpus), email, all letters, all sermons
Author gender	The author’s gender	Instant messaging, email, all letters, all sermons
Recipient name	The recipient’s name	Instant messaging, email, all letters
Recipient gender	The recipient’s gender	Instant messaging, email, all letters
Location	Where the text was composed/written	18-19^th Century letters
Title	The title of the text, if applicable	All statutes, all sermons

Text ID label key

Text ID Label	What it refers to
S15, e.g. S15_001	15^th Century sermon, first text file (text file number included for information)
S16	16^th Century sermon
S17	17^th Century sermon
S18	18^th Century sermon
S19	19^th Century sermon
S20	20^th Century sermon
S21	21^st Century sermon
T15	15^th Century statute
T16	16^th Century statute
T17	17^th Century statute
T18	18^th Century statute
T19	19^th Century statute
T20	20^th Century statute
T21	21^st Century statute
L15	15^th Century letter
L16	16^th Century letter
L17	17^th Century letter
L18	18^th Century letter
L19	19^th Century letter
L20	20^th Century letter
E21	21^st Century email
I21	21^st Century instant message

Tools to work with the Transhistorical Corpus of Written English

A complete set of tools is available to work with this English transhistorical corpus to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives, etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
trends – diachronic analysis automatically identifies neologisms and changes in use
text type analysis – statistics of metadata in the corpus

Bibliography - Copyright and permissions

Copyright and permissions

The texts within the corpus have come from a range of sources. Many of them, such as the emails, taken from the freely available online ENRON corpus, are within the public domain. It was not therefore necessary to seek permission to reproduce these texts. Other parts of the TCWE, predominantly the correspondence sub-corpus, but also parts of the sermon sub-corpus, are texts from other corpora and websites which have been reproduced here, with the permission of the creators. The Whatsapp instant messaging data was obtained from private individuals who chose to donate their Whatsapp messages to the corpus. No permission is needed to access them. Further details can be found below.

Permissions pertaining to and information about the Whatsapp instant
messages in the corpus:

The data in the Whatsapp instant messaging sub-corpus was collected by the project

leader Dr Imogen Marcus. Consent was gained in advance from everyone who donated their messaging data for this sub-corpus. Information regarding the Whatsapp sub-corpus, for the attention of corpus users: text files I21_001-I21_005 are all chats between two people in which both people’s messages are included. Text files I21_006 – I21_008 are all chats between two people in which one of the two writers chose for their messages to be excluded from the corpus. These are therefore two person chats in which only one person’s messages are included. Text files I21_009 and I21_010 are group chats in which everyone’s messages are included. The Whatsapp messages are licenced under a Creative Commons Attribution 4.0 Licence.

Copyright and permissions pertaining to the correspondence sub-corpus

Copyright agreement pertaining to the 15-17^th Century correspondence included in the corpus:

The TCWE incorporates a sample of 15^th-17^th Century letters from Corpus of Early English Correspondence Sampler (CEECS) within its correspondence sub-corpus. This document recognizes and acknowledges that, as copyright holders of CEECS, the CEEC team (led by Professor Terttu Nevalainen) have agreed to allow the inclusion of these letters in the TCWE corpus, and for the same to be made available on the Sketch Engine platform. The full references and credit lines for these letters are listed below:

CEEC = Corpus of Early English Correspondence. Compiled by the CEEC team under Terttu Nevalainen at the Department of Modern Languages, University of Helsinki. https://varieng.helsinki.fi/CoRD/corpora/CEEC/

CEECS = Corpus of Early English Correspondence Sampler. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.

Nurmi, Arja (ed.). 1998. Manual for the Corpus of Early English Correspondence Sampler CEECS. Department of Modern Languages. University of Helsinki. clu.uni.no/icame/manuals/CEECS/INDEX.HTM.

PCEEC = Parsed Corpus of Early English Correspondence. 2006. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Helsinki: University of Helsinki and York: University of York.

Copyright agreement pertaining to the 18^th Century letters in the corpus:

The sample of 18^th Century letters from the Corpus of Late Eighteenth Century Prose have been reproduced with the permission of Professor David Denison, University of Manchester and Dr. Linda van Bergen, University of Edinburgh. This document recognizes and acknowledges the John Rylands University Library of Manchester, where the originals of the texts are held, as well as the ‘The English language of the north-west in the late Modern English period’ project, directed by David Denison, with Linda van Bergen as principal collaborator.

Copyright agreement pertaining to the 19^th Century letters in the corpus:

The sample of 19^th Century letters from the Corpus of Late Modern English Prose have been reproduced with the permission of Professor David Denison. The Corpus of Late Modern English Prose was constructed between 1992 and 1994 by Prof. David Denison, Department of English Language & Literature, University of Manchester, with the very considerable assistance of Graeme Trousdale and Linda van Bergen.

Copyright agreement pertaining to the 20^th Century letters in the corpus:

The sample of 20th Century letters from the British Telecom Correspondence Corpus (BTCC) have been reproduced with the permission of its creator Dr. Ralph Morton (Birmingham City University).

Copyright and permissions pertaining to the sermon sub-corpus

The 21^st Century sermons have been taken from the Lancaster Priory website and reproduced with the permission of their authors: Joel Love, Kara Cooper, Chris Newlands, John Rodwell and Kevin Huggett.

Search the Transhistorical Corpus of Written English

Sketch Engine offers a range of tools to search and analyze this diachronic corpus.

Go to corpus (no account required)

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide