Project Gutenberg English corpus

The Project Gutenberg English corpus is a corpus made up of all English e-books available in the Gutenberg database in October 2014.

  • downloaded with wget: getting Gutenberg
  • cleaned with justext (slightly changed algorithm)
  • title and author sometimes retrievable from HTML META tags

Part-of-speech tagset

The Project Gutenberg English corpus was tagged by TreeTagger using Penn TreeBank tagset.

Tools to work with the Project Gutenberg English corpus

A complete set of tools is available to work with this Gutenberg corpus to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Search the Project Gutenberg English corpus

Sketch Engine offers a range of tools to work with this Gutenberg corpus.

or

Other text corpora

Sketch Engine offers 450+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.