Santa Barbara Corpus of Spoken American English

Santa Barbara Corpus of Spoken American English (SBCSAE)

The Santa Barbara Corpus of Spoken American English (SBCSAE) is an English corpus based on a large volume of natural spoken interactions coming from all over the United States. The corpus represents a variety of people of different regional origins, ages, occupations, genders, ethnic and social backgrounds. Such information is also included in metadata, which you can access using the Text Type Analysis function.

The corpus includes transcriptions as well as audio tracks. To play the audio, please click on the play button (red icon) on the right side of the concordance line. See the screenshot below:

Santa Barbara corpus (audio)

The corpus was created in the Linguistics Department of the University of California, Santa Barbara, under the licence CC BY-ND 3.0 US DEED.

Please refer to the official website for more information: https://www.linguistics.ucsb.edu/research/santa-barbara-corpus

Individual recordings can be found here: https://sla.talkbank.org/TBB/ca/SBCSAE/01.cha

Part-of-speech tagset and lemmatization

The English corpora are part-of-speech tagged with the following English Penn Treebank tagset summary (with Sketch Engine modifications) indicating the part of speech and grammatical category. The corpus texts also contain lemmatization when each word form from the corpus is assigned to its base form (lemma).

Search the Santa Barbara corpus

Sketch Engine offers a range of tools to work with this English corpus.

open in Sketch Engine

about Sketch Engine