Error corpus from English Wikipedia
The Error corpus from English Wikipedia is an error corpus made up of a sample of texts collected from the English Wikipedia. This is an only 1-million-word sample from English Wikipedia created in March 2017.
Types of errors
The automatic tool marks six types of errors in texts:
- lexicosemantic – lexico-semantic errors
- punct – mistakes in punctuations
- spelling – misspelling
- style – typos relating to style
- typographical – mistakes relating to typography
- unclassified – other types of typos
This English error corpus was tagged by TreeTagger using Penn TreeBank tagset with Sketch Engine modifications.