British Academic Spoken English Corpus (BASE)

BASE: British Academic Spoken English corpus

The British Academic Spoken English (BASE) is a text corpus developed at the University of Warwick and the University of Reading in a project which took place between 2000–2005 under the directorship of Hilary Nesi (Warwick) and Paul Thompson (Reading). The corpus consists of 160 lectures and 38 seminars* (video-recorded at the University of Warwick and audio-recorded at the University of Reading) with a total size of 1.75 million tokens. The lectures and seminars have been transcribed and annotated in accordance with the TEI Guidelines. Each of the audio transcriptions belongs to one of the four main academic divisions, which is each represented by 40 lectures and 10 seminars:

Arts and Humanities
Life Sciences
Physical Sciences
Social Sciences

Further information can be found at pages of Coventry University or the University of Warwick pages.

*In fact, the original corpus contains 39 seminar transcripts. However, we have been asked by the author of the corpus to remove the file ”lssem003″ that overlaps with some of the lecture data.

Part-of-speech tagset

As the POS tagger tool was used TreeTagger with the following tagset.

Tools to work with the British Academic Spoken English corpus

A complete set of Sketch Engine tools is available to work with the British Academic Spoken English Corpus to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Bibliographic references

The British Academic Spoken English (BASE) corpus is freely available to researchers who agree to the following conditions:

Corpus holdings should not be reproduced in full for a wider audience/readership (ie for publication or for teaching purposes), although researchers are free to quote short passages of text up to 100 running words, with a total of 200 running words from any given assignment.
No part of the corpus holdings should be reproduced in teaching materials intended for publication (in print or via the internet).
The corpus developers should be informed of all presentations and publications arising from analysis of the corpus.

Researchers must acknowledge their use of the BASE corpus using the following form of words: The recordings and transcriptions used in this study come from the British Academic Spoken English (BASE) corpus. The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academy and the Arts and Humanities Research Council.

Files should be referred to by their letter and number codes, indicating disciplinary grouping (e.g. ah = arts and humanities), type of speech event (e.g. lct = lecture) and file number.

Nesi, H. and H. Basturkmen (2006) ‘Lexical bundles and discourse signalling in academic lectures’. International Journal of Corpus Linguistics 11(3) 147-168

Thompson, P. (2006) ‘A corpus perspective on the lexis of lectures, with a focus on Economics lectures’. In K. Hyland and M. Bondi (eds) Academic Discourse Across Disciplines Bern: Peter Lang, pp. 253-270

Nesi, H. (2002) ‘An English spoken academic word list’ , in Braasch, A. and Provlsen, C. (eds) Proceedings of the Tenth EURALEX International Congress, Copenhagen: Center for Sprogteknologi

Nesi, H. (2001) ‘A corpus based analysis of academic lectures across disciplines’, in: Cotterill, J. and Ife A. (eds) Language Across Boundaries, London: Continuum Press

Changelog

version 2 (January 2025)

Empty tokens between structures were deleted to make sure the surrounding tokens can be queried properly. As a result, a few empty documents were removed altogether, so the overall number of documents in this version is a bit lower than in the previous one. No real text was affected, as only empty documents were deleted.

Search the English BASE corpus

Sketch Engine offers a range of tools to work with the British Academic Spoken English corpus.

search BASE corpus

Try a 30-day free trial

about Sketch Engine

English Trends corpus

Explore our largest English corpus, which totals over 80 billion words and grows automatically every week.

English Trends corpus

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

BASE: British Academic Spoken English corpus

Part-of-speech tagset

Tools to work with the British Academic Spoken English corpus

version 2 (January 2025)

Search the English BASE corpus

Try a 30-day free trial

English Trends corpus

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine