Error corpus from English Wikipedia

The Error corpus from English Wikipedia is an error corpus made up of a sample of texts collected from the English Wikipedia. This is an only 1-million-word sample from English Wikipedia created in March 2017.

Types of errors

The automatic tool marks six types of errors in texts:

Code Description Example
lexicosemantic lexico-semantic errors <lexicosemantic> mentor | a mentor </lexicosemantic>
punct mistakes in punctuations <punct> ! | . </punct>
spelling misspelling <spelling> intensly | intensely </spelling>
style typos relating to style <style> other | the feasibility of other </style>
typographical mistakes relating to typography <typographical> ‘ | ” </typographical>
unclassified other types of typos <unclassified> is to be | was </unclassified>

Part-of-speech tagset

This English error corpus was tagged by TreeTagger using Penn TreeBank tagset with Sketch Engine modifications.

Tools to work with the error corpus

A complete set of tools is available to work with this English error corpus to generate:

  • error tagging – errors marked by the type of error (spelling, typography, etc.)
  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context


initial version – sample (March 2017)

  • 1-million-word sample from English Wikipedia


KLETEČKA, Jiří. Wikipedia Learner’s Corpus [online]. Brno, 2017 [cit. 2018-03-07]. Available from: . Bachelor’s thesis. Masaryk University, Faculty of Informatics. Thesis supervisor Vít Baisa.

Search the error corpus

Sketch Engine offers a range of tools to work with this error corpus from the English Wikipedia.

Your own Wikipedia corpora

We can build a Wikipedia corpus in any language for you. Please contact us.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.