ukWaC – British English corpus from the .uk domain
The ukWaC is a text corpus of British English collected from the .uk domain with using medium-frequency words from the British National Corpus as seed words. These two facts are fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.
The corpus was prepared by Adriano Ferraresi and word sketches enabling to explore the grammatical relations of words were prepared by David Tugwell. The whole preparation of the corpus is described in Introducing and evaluating ukWaC, a very large web-derived corpus of English (LREC conference, 2008; crawled from Webarchive).
Sketch Engine provides access to the version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).
It was part-of-speech tagged and lemmatized using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.