ukWaC – British English corpus from the .uk domain
The ukWaC is a text corpus of British English collected from the .uk domain with using medium-frequency words from the British National Corpus as seed words. These two facts are fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.
The corpus was prepared by Adriano Ferraresi and word sketches enabling to explore the grammatical relations of words were prepared by David Tugwell. The whole preparation of the corpus is described in Introducing and evaluating ukWaC, a very large web-derived corpus of English (LREC conference, 2008; crawled from Webarchive).
It was part-of-speech tagged and lemmatized using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.
A complete set of tools is available to work with this ukWaC corpus to generate:
- word sketch – English collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 2009, 43.3: 209-226.
FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English [crawled from Webarchive]. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.
CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.