Error corpus from English Wikipedia
The Error corpus from English Wikipedia is an error corpus made up of a sample of texts collected from the English Wikipedia. This is an only 1-million-word sample from English Wikipedia created in March 2017.
Types of errors
The automatic tool marks six types of errors in texts:
Code | Description | Example |
lexicosemantic | lexico-semantic errors | <lexicosemantic> mentor | a mentor </lexicosemantic> |
punct | mistakes in punctuations | <punct> ! | . </punct> |
spelling | misspelling | <spelling> intensly | intensely </spelling> |
style | typos relating to style | <style> other | the feasibility of other </style> |
typographical | mistakes relating to typography | <typographical> ‘ | ” </typographical> |
unclassified | other types of typos | <unclassified> is to be | was </unclassified> |
Part-of-speech tagset
This English error corpus was tagged by TreeTagger using Penn TreeBank tagset with Sketch Engine modifications.
Tools to work with the error corpus
A complete set of tools is available to work with this English error corpus to generate:
- error tagging – errors marked by the type of error (spelling, typography, etc.)
- word sketch – English collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
initial version – sample (March 2017)
- 1-million-word sample from English Wikipedia
Bibliography
KLETEČKA, Jiří. Wikipedia Learner’s Corpus [online]. Brno, 2017 [cit. 2018-03-07]. Available from:
Search the error corpus
Sketch Engine offers a range of tools to work with this error corpus from the English Wikipedia.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.