Error corpus from English Wikipedia
The Error corpus from English Wikipedia is an error corpus made up of a sample of texts collected from the English Wikipedia. This is an only 1-million-word sample from English Wikipedia created in March 2017.
Types of errors
The automatic tool marks six types of errors in texts:
- lexicosemantic – lexico-semantic errors, e.g.
<lexicosemantic> mentor | a mentor </lexicosemantic>
- punct – mistakes in punctuations, e.g.
<punct> ! | . </punct>
- spelling – misspelling, e.g.
| intensely </spelling>
- style – typos relating to style, e.g.
<style> other | the feasibility of other </style>
- typographical – mistakes relating to typography, e.g.
<typographical> ' | " </typographical>
- unclassified – other types of typos, e.g.
<unclassified> is to be | was </unclassified>
This English error corpus was tagged by TreeTagger using Penn TreeBank tagset with Sketch Engine modifications.