BASE: British Academic Spoken English

The British Academic Spoken English (BASE) is a text corpus developed at the Universities of Warwick and Reading. The corpus version within Sketch Engine consists of 160 lectures (video-recorded at the University of Warwick and audio-recorded at the University of Reading) with total size 1.1 million words. The lectures have been transcribed and annotated in accordance with the TEI Guidelines. Each of the audio transcriptions belongs to one of the four main academic divisions:

  • Arts and Humanities
  • Life Sciences
  • Physical Sciences
  • Social Sciences

Further information can be found at

Part-of-speech tagset

As the POS tagger tool was used CLAWS with the following tagset version 7.

A complete set of Sketch Engine tools is available to work with the British Academic Spoken English Corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

The British Academic Spoken English (BASE) corpus is freely available to researchers who agree to the following conditions:

  1. Corpus holdings should not be reproduced in full for a wider audience/readership (ie for publication or for teaching purposes), although researchers are free to quote short passages of text up to 100 running words, with a total of 200 running words from any given assignment.
  2. No part of the corpus holdings should be reproduced in teaching materials intended for publication (in print or via the internet).
  3. The corpus developers should be informed of all presentations and publications arising from analysis of the corpus.

Researchers must acknowledge their use of the BASE corpus using the following form of words: The recordings and transcriptions used in this study come from the British Academic Spoken English (BASE) corpus. The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academy and the Arts and Humanities Research Council.

Files should be referred to by their letter and number codes, indicating disciplinary grouping (e.g. ah = arts and humanities), type of speech event (e.g. lct = lecture) and file number.

Nesi, H. and H. Basturkmen (2006) ‘Lexical bundles and discourse signalling in academic lectures’. International Journal of Corpus Linguistics 11(3) 147-168

Thompson, P. (2006) ‘A corpus perspective on the lexis of lectures, with a focus on Economics lectures’. In K. Hyland and M. Bondi (eds) Academic Discourse Across Disciplines Bern: Peter Lang, pp. 253-270

Nesi, H. (2002) ‘An English spoken academic word list’ , in Braasch, A. and Provlsen, C. (eds) Proceedings of the Tenth EURALEX International Congress, Copenhagen: Center for Sprogteknologi

Nesi, H. (2001) ‘A corpus based analysis of academic lectures across disciplines’, in: Cotterill, J. and Ife A. (eds) Language Across Boundaries, London: Continuum Press

Search the English BASE corpus

Sketch Engine offers a range of tools to work with the British Academic Spoken English corpus.

Concordance from the BAWE corpus

Try a 30-day free trial


Other text corpora in Sketch Engine

Sketch Engine offers 500+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.