Multicultural London English corpus

The Multicultural London English corpus is a spoken English corpus made up of transcripts collected in London. The corpus represents Multicultural London English, the sociolect of English comprised of new English varieties in London from the late 20th century. It contains transcripts of informal conversation-like interviews with 1 or 2 speakers and a fieldworker and some self-recordings. The transcripts are from two ESRC-funded projects: Linguistic Innovators, and Multicultural London English.

This English corpus consists of 2.4 million words which are divided into separate subcorpora based on the nationality of speakers.

For more details about the speakers and the research projects from which these transcripts derive, see the bibliography.

Part-of-speech tagset

The Multicultural London English corpus was processed using TreeTagger with the Penn TreeBank tagset.

Tools to work with the London English corpus

A complete set of tools is available to work with this spoken English corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords– terminology extraction of one-word
  • text type analysis – statistics of metadata in the corpus

Cheshire, J., Kerswill, P., Fox, S. and Torgersen, E. (2011). Contact, the feature pool and the speech community: The emergence of Multicultural London English. In Journal of Sociolinguistics 15, pp. 151–196.

