SemCor – sense-tagged English corpus

SemCor: semantically annotated English corpus

The SemCor corpus is an English corpus with semantically annotated texts. The semantic analysis was done manually with WordNet 1.6 senses (SemCor version 1.6) and later automatically mapped to WordNet 3.0 (SemCor version 3.0). The SemCorpus corpus consists of 352 texts from Brown corpus.

This sense-tagged corpus SemCor 3.0 was automatically created from SemCor 1.6 by mapping WordNet 1.6 to WordNet 3.0 senses. SemCor 1.6 was created and is property of Princeton University. The automatic mapping was performed by Rada Mihalcea (rada@cs.unt.edu).

The corpus has also multi-word expressions (MWE) marked with underscore (_), e.g. manor_house. These multi-word units were annotated by Siva Reddy.

Part-of-speech tagset

SemCor was tagged by TreeTagger using Penn TreeBank tagset.

License

WordNet Release 1.6

Semantic Concordance Release 1.6

This software and database is being provided to you, the LICENSEE, by Princeton University under the following license. By obtaining, using and/or copying this software and database, you agree that you have

read, understood, and will comply with these terms and conditions.:

Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with

the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal

use or for distribution.

THIS SOFTWARE AND DATABASE IS PROVIDED “AS IS” AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.

The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright in this software, database and any associated documentation shall at all times remain with Princeton University and LICENSEE agrees to preserve same.

Tools to work with the SemCor corpus

A complete set of tools is available to work with this sense-annotated English corpus to generate:

keywords – terminology extraction of one-word units
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus