pukWaC: ukWaC English corpus parsed with MaltParser

The pukWaC is a subset of the British English corpus ukWaC collected from the .uk domain with using medium-frequency words from the British National Corpus as seed words. In addition to the ukWaC corpus, the pukWaC corpus contains the syntax dependency annotation which shows the dependency between units in one sentence, i.e. which word depends which. This type of parsing was performed with the MaltParser.

Syntactic classes

ADV Unclassified adverbial
BNF Benefactor (the for phrase for verbs that undergo dative shift)
DIR Direction
DTV Dative (the to phrase for verbs that undergo dative shift)
EXT Extent
LGS Logical subject
LOC Location
MNR Manner
PRD Predicative complement
PRP Purpose or reason
PUT Various locative complements of the verb put
SBJ Subject
TMP Temporal
VOC Vocative
AMOD Modifier of adjective or adverb
CONJ Between conjunction and second conjunct in a coordination
COORD Coordination
DEP Unclassified relation
EXTR Extraposed element in expletive constructions
GAP Gapping: between conjunction and the parts of a structure with an ellipsed head
IM Between infinitive marker and verb
NMOD Modifier of nominal
OBJ Direct or indirect object or clause complement
OPRD Object complement
P Punctuation
PMOD Between preposition and its child in a PP
PRN Parenthetical
PRT Particle
SUB Between subordinating conjunction and verb
VC Verb chain

Source: Johansson, Richard. “Dependency syntax in the conll shared task 2008.”
See also: Building sketches from parsed corpora

Part-of-speech tagset

The pukWaC corpus was tagged by TreeTagger using Penn TreeBank tagset.

Tools to work with the pukWaC corpus

A complete set of tools is available to work with this English corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context


