ukWaC – British Web corpus from the .uk domain

The British Web (ukWaC) is an English corpus collected from the .uk domain using medium-frequency words from the British National Corpus as seed words. These two facts are fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.

The corpus was prepared by Adriano Ferraresi and word sketches which enable users to explore the grammatical relations of words were prepared by David Tugwell. The whole preparation of the corpus is described in Introducing and evaluating ukWaC, a very large web-derived corpus of English (LREC conference, 2008; crawled from Webarchive).

Sketch Engine provides access to the version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).

Part-of-speech tagset

It was part-of-speech tagged and lemmatized using TreeTagger, a leading part-of-speech tagger that has been trained for a number of languages. It uses Penn Treebank Tagset.

A complete set of tools is available to work with this British Web 2007 corpus to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Bibliography

BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 2009, 43.3: 209-226.

FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English [crawled from Webarchive]. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.

CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.