What is British National Corpus?

The British National Corpus (BNC) is a 100-million-word collection of samples of a written and spoken language of British English from the later part of the 20th century.

The British National Corpus consists of the bigger written part (90 %, e.g. newspapers, academic books, letters, essays, etc.) and the smaller spoken part (remaining 10 %, e.g. informal conversations, radio shows, etc.). The spoken part is also available in the audio format, and it can be played directly in the Sketch Engine interface.

The corpus texts contain a large amount of information and thus each user can use many search criteria as a time of publication, region captured spoken text, type of media and text domain, or the David Lee’s classification – a detailed genre specification. The full list of genres of this classification is here.

The official website: http://www.natcorp.ox.ac.uk

BNC corpus in detail

Corpus sizes

Tokens 112,338,376
Words 96,132,981
Sentences 6,052,128
Documents 4,054

Tools to work with British National Corpus

A complete set of tools is available to work with the British National Corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

Part-of-speech tagsets and lemmatization

The BNC corpus is part-of-speech tagged with the following Penn TreeBank tagset summary (with Sketch Engine modifications) indicating the part of speech and grammatical category. The corpus texts contain lemmatization when each word form from the corpus is assigned to its base form (lemma). Sketch Engine provides access to the BNC corpus version tagged with the CLAWS POS tagset containing specific attributes:

  • ambtag: the ambivalent part of speech tag (all tags before disambiguation)
  • pos: one-letter abbreviation of the part of speech (the second part of lempos)

version 3.1 (September 2023)

  • fixed encoding – characters such as ěščřžý can be searched for now
  • compiled text types for word sketch results
  • reprocessed with the TreeTagger pipeline version 3.1

version 2.2.1 (5th April 2017)

  • retagged with the TreeTagger pipeline version 2.1

version 2.2 (1st February 2017)

  • retagged with the TreeTagger pipeline version 2

version 2.0 (8th November 2010)

  • replaced SGML entities (such as " with correspondent Unicode characters)
  • added tags (spoken texts)

How to reference Sketch Engine

  • The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/
  • Reference Guide for the British National Corpus (XML Edition) edited by Lou Burnard, February 2007. URL: http://www.natcorp.ox.ac.uk/XMLedition/URG/
  • The British National Corpus, version 2 (BNC World). 2001. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/
  • The British National Corpus Users Reference Guide edited by Lou Burnard, October 2000. URL: http://www.natcorp.ox.ac.uk/archive/index.xml
  • The BNC Baby, version 2. 2005. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/
  • The BNC Sampler, XML version. 2005. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/

Data from the BNCOur policy is to request that citations from the British National Corpus should include the text identifier (a 3 letter code) and sentence number. A suitable form of words for crediting the BNC would be:

  • “Examples of usage taken from the British National Corpus (BNC) were obtained under the terms of the BNC End User Licence. Copyright in the individual texts cited resides with the original IPR holders. For information and licensing conditions relating to the BNC, please see the website at http://www.natcorp.ox.ac.uk “
  • or: “Data cited herein have been extracted from the British National Corpus, distributed by Oxford University Computing Services on behalf of the BNC Consortium. All rights in the texts cited are reserved.”

Search the British National Corpus

Sketch Engine offers a range of tools to work with this British English Corpus.

british national corpus bnc concordance

Other English corpora

Explore our largest Timestamped English corpus with 70+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract. Use our Quick Start Guide to learn it in minutes.