SemCor: semantically annotated English corpus

The SemCor corpus is an English corpus with semantically annotated texts. The semantic analysis was done manually with WordNet 1.6 senses (SemCor version 1.6) and later automatically mapped to WordNet 3.0 (SemCor version 3.0). The SemCorpus corpus consists of 352 texts from Brown corpus.

This sense-tagged corpus SemCor 3.0 was automatically created from SemCor 1.6 by mapping WordNet 1.6 to WordNet 3.0 senses. SemCor 1.6 was created and is property of Princeton University. The automatic mapping was performed by Rada Mihalcea (rada@cs.unt.edu).

The corpus has also multi-word expressions (MWE) marked with underscore (_), e.g. manor_house. These multi-word units were annotated by Siva Reddy.

Part-of-speech tagset

SemCor was tagged by TreeTagger using Penn TreeBank tagset.

WordNet Release 1.6
Semantic Concordance Release 1.6
This software and database is being provided to you, the LICENSEE, by  Princeton University under the following license.  By obtaining, using  and/or copying this software and database, you agree that you have
read, understood, and will comply with these terms and conditions.:
Permission to use, copy, modify and distribute this software and  database and its documentation for any purpose and without fee or  royalty is hereby granted, provided that you agree to comply with
the following copyright notice and statements, including the disclaimer,  and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal
use or for distribution.
WordNet 1.6 Copyright 1997 by Princeton University.  All rights reserved.
THIS SOFTWARE AND DATABASE IS PROVIDED “AS IS” AND PRINCETON  UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  OTHER RIGHTS.
The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database.  Title to copyright in this software, database and any associated documentation shall at all times remain with  Princeton University and LICENSEE agrees to preserve same.

Tools to work with the SemCor corpus

A complete set of tools is available to work with this sense-annotated English corpus to generate:

  • keywords – terminology extraction of one-word units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Search the sense-tagged annotated corpus

Sketch Engine offers a range of tools to work with the SemCor corpus.

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms is. Use our Quick Start Guide to learn it in minutes.