ukWaC – British Web corpus from the .uk domain

The British Web (ukWaC) is an English corpus collected from the .uk domain using medium-frequency words from the British National Corpus as seed words. These two facts are fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.

The corpus was prepared by Adriano Ferraresi and word sketches which enable users to explore the grammatical relations of words were prepared by David Tugwell. The whole preparation of the corpus is described in Introducing and evaluating ukWaC, a very large web-derived corpus of English (LREC conference, 2008; crawled from Webarchive).

Sketch Engine provides access to the version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).

Part-of-speech tagset

It was part-of-speech tagged and lemmatized using TreeTagger, a leading part-of-speech tagger that has been trained for a number of languages. It uses Penn Treebank Tagset.

A complete set of tools is available to work with this British Web 2007 corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corporaLanguage resources and evaluation, 2009, 43.3: 209-226.

FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English [crawled from Webarchive]. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.

CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.

Search the British Web corpus

Sketch Engine offers a range of tools to work with this ukWaC – British Web corpus containing mainly British English.

Concordance lines from ukWaC – British Web corpus

or

Other English corpora

Explore our largest Timestamped English corpus with 70+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.