Open American National Corpus (OANC)

OANC: Open American National Corpus

The OANC-MASC Corpus

The Open American National Corpus (OANC) and its subcorpus The Manually Annotated Sub-Corpus (MASC) is a text corpus of American English. Texts in the corpus include all genres and transcripts of spoken data produced from 1990 onward. The whole corpus is comprised of 11 million words.

The MASC subcorpus consist of 480k words with manually validated annotations for sentence boundaries, tokens, lemmas, POS, noun, verb chunks, named entities (person, location, organization, date), coreference and discourse structure.

The OANC-MASC corpus contains merged data from OANC and MASC corpus. Because the MASC is a sub-corpus of OANC in the resulting OANC-MASC corpus the OANC’s MASC part was replaced by the MASC data to remove duplicated documents.

The OANC-MASC corpus has two separate parts: The OANC-MASC Written and The OANC-MASC Spoken part.

For more information visit http://www.anc.org

Part-of-speech tagset

This OANC corpus is tagged by TreeTagger tool using Penn TreeBank tagset with Sketch Engine modifications.

Available tools

A complete set of tools is available to work with this English corpus to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Bibliography

Open American National Corpus (OANC)

Ide, N. (2008). The American National Corpus: Then, Now, and Tomorrow. In Michael Haugh, Kate Burridge, Jean Mulder and Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, Cascadilla Proceedings Project, Sommerville, MA.

The Manually Annotated subcorpus (MASC)

Ide, N., Baker, C., Fellbaum, C., Fillmore, C., Passonneau, R. (2008). MASC: The Manually Annotated Sub-Corpus of American English. Proceedings of the Sixth Language Resources and Evaluation Conference (LREC), Marrakech, Morocco.