Error corpus from English Wikipedia

The Error corpus from English Wikipedia is an error corpus made up of a sample of texts collected from the English Wikipedia. This is an only 1-million-word sample from English Wikipedia created in March 2017.

Types of errors

The automatic tool marks six types of errors in texts:

  • lexicosemantic – lexico-semantic errors
  • punct – mistakes in punctuations
  • spelling – misspelling
  • style – typos relating to style
  • typographical – mistakes relating to typography
  • unclassified – other types of typos

Part-of-speech tagset

This English error corpus was tagged by TreeTagger using Penn TreeBank tagset with Sketch Engine modifications.

Tools to work with the error corpus

A complete set of tools is available to work with this English error corpus to generate:

  • errors – errors marked by the type of error (spelling, typography, etc.)
  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

initial version – sample (March 2017)

  • 1-million-word sample from English Wikipedia

Bibliography

KLETEČKA, Jiří. Wikipedia Learner’s Corpus [online]. Brno, 2017 [cit. 2018-03-07]. Available from: . Bachelor’s thesis. Masaryk University, Faculty of Informatics. Thesis supervisor Vít Baisa.

Search the error corpus

Sketch Engine offers a range of tools to work with this error corpus from the English Wikipedia.

Your own Wikipedia corpora

We can build a Wikipedia corpus in any language for you. Please contact us.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.