BASE: British Academic Spoken English corpus

The British Academic Spoken English (BASE) is a text corpus developed at the University of Warwick and the University of Reading in a project which took place between 2000–2005 under the directorship of Hilary Nesi (Warwick) and Paul Thompson (Reading). The corpus consists of 160 lectures and 38 seminars* (video-recorded at the University of Warwick and audio-recorded at the University of Reading) with a total size of 1.75 million tokens. The lectures and seminars have been transcribed and annotated in accordance with the TEI Guidelines. Each of the audio transcriptions belongs to one of the four main academic divisions, which is each represented by 40 lectures and 10 seminars:

  • Arts and Humanities
  • Life Sciences
  • Physical Sciences
  • Social Sciences

Further information can be found at pages of Coventry University or the University of Warwick pages.

*In fact, the original corpus contains 39 seminar transcripts. However, we have been asked by the author of the corpus to remove the file ”lssem003″ that overlaps with some of the lecture data.

Part-of-speech tagset

As the POS tagger tool was used  TreeTagger with the following tagset.

Tools to work with the British Academic Spoken English corpus

A complete set of Sketch Engine tools is available to work with the British Academic Spoken English Corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

The British Academic Spoken English (BASE) corpus is freely available to researchers who agree to the following conditions:

  1. Corpus holdings should not be reproduced in full for a wider audience/readership (ie for publication or for teaching purposes), although researchers are free to quote short passages of text up to 100 running words, with a total of 200 running words from any given assignment.
  2. No part of the corpus holdings should be reproduced in teaching materials intended for publication (in print or via the internet).
  3. The corpus developers should be informed of all presentations and publications arising from analysis of the corpus.

Researchers must acknowledge their use of the BASE corpus using the following form of words: The recordings and transcriptions used in this study come from the British Academic Spoken English (BASE) corpus. The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academy and the Arts and Humanities Research Council.

Files should be referred to by their letter and number codes, indicating disciplinary grouping (e.g. ah = arts and humanities), type of speech event (e.g. lct = lecture) and file number.


Nesi, H. and H. Basturkmen (2006) ‘Lexical bundles and discourse signalling in academic lectures’. International Journal of Corpus Linguistics 11(3) 147-168

Thompson, P. (2006) ‘A corpus perspective on the lexis of lectures, with a focus on Economics lectures’. In K. Hyland and M. Bondi (eds) Academic Discourse Across Disciplines Bern: Peter Lang, pp. 253-270

Nesi, H. (2002) ‘An English spoken academic word list’ , in Braasch, A. and Provlsen, C. (eds) Proceedings of the Tenth EURALEX International Congress, Copenhagen: Center for Sprogteknologi

Nesi, H. (2001) ‘A corpus based analysis of academic lectures across disciplines’, in: Cotterill, J. and Ife A. (eds) Language Across Boundaries, London: Continuum Press

Search the English BASE corpus

Sketch Engine offers a range of tools to work with the British Academic Spoken English corpus.

Concordance from the British Academic Spoken English corpus

Try a 30-day free trial

English Trends corpus

Explore our largest English corpus, which totals over 80 billion words and grows automatically every week.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.