Error corpus from English Wikipedia
The Error corpus from English Wikipedia is an error corpus made up of a sample of texts collected from the English Wikipedia. This is an only 1-million-word sample from English Wikipedia created in March 2017.
Types of errors
The automatic tool marks six types of errors in texts:
||<lexicosemantic> mentor | a mentor </lexicosemantic>
||mistakes in punctuations
||<punct> ! | . </punct>
||<spelling> intensly | intensely </spelling>
||typos relating to style
||<style> other | the feasibility of other </style>
||mistakes relating to typography
||<typographical> ‘ | ” </typographical>
||other types of typos
||<unclassified> is to be | was </unclassified>
This English error corpus was tagged by TreeTagger using Penn TreeBank tagset with Sketch Engine modifications.