OANC: Open American National Corpus

The OANC-MASC Corpus

The Open American National Corpus (OANC) and its subcorpus  The Manually Annotated Sub-Corpus (MASC) is a text corpus of American English. Texts in the corpus include all genres and transcripts of spoken data produced from 1990 onward. The whole corpus is comprised of 11 million words.

The MASC subcorpus consist of 480k words with manually validated annotations for sentence boundaries, tokens, lemmas, POS, noun, verb chunks, named entities (person, location, organization, date), coreference and discourse structure.

The OANC-MASC corpus contains merged data from OANC and MASC corpus. Because the MASC is a sub-corpus of OANC in the resulting OANC-MASC corpus the OANC’s MASC part was replaced by the MASC data to remove duplicated documents.

The OANC-MASC corpus has two separate parts: The OANC-MASC Written and The OANC-MASC Spoken part.

For more information visit http://www.anc.org

Part-of-speech tagset

This OANC corpus is tagged by TreeTagger tool using Penn TreeBank tagset with Sketch Engine modifications.

Open American National Corpus (OANC)

Ide, N. (2008). The American National Corpus: Then, Now, and Tomorrow. In Michael Haugh, Kate Burridge, Jean Mulder and Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, Cascadilla Proceedings Project, Sommerville, MA.

The Manually Annotated subcorpus (MASC)

Ide, N., Baker, C., Fellbaum, C., Fillmore, C., Passonneau, R. (2008). MASC: The Manually Annotated Sub-Corpus of American English. Proceedings of the Sixth Language Resources and Evaluation Conference (LREC), Marrakech, Morocco.

