pukWaC: ukWaC English corpus parsed with MaltParser

The pukWaC is a 40-million-word subset of the British English corpus ukWaC collected from the .uk domain with using medium-frequency words from the British National Corpus as seed words. In addition to the ukWaC corpus, the pukWaC corpus contains the syntax dependency annotation which shows the dependency between units in one sentence, i.e. which word depends which. This type of parsing was performed with the MaltParser.

The following syntactic classes are used in the word sketch feature to name particular relations, e.g. word sketch for the noun “name” gives 7 different relations, one of them is NMOD to mark modifiers of “name”: domain name, your name.

The syntactic classes can be displayed via View options in the Concordance tool by checking out “dependency relation” attribute.

Syntactic classes (word sketch relations)

ADV Unclassified adverbial
AMOD Modifier of adjective or adverb
CC Conjunction
COORD Coordination
DEP Unclassified relation
EXP Expletive (a word in a sentence that is not needed to express the basic meaning of the sentence)
IOBJ Indirect object
LGS Logical subject
NMOD Modifier of nominal
OBJ Direct object or clause complement
PMOD Between preposition and its child in a PP
P Punctuation
PRD Predicative complement
PRN Parenthetical
PRT Particle
SBJ Subject
VC Verb chain
VMOD Modifier of verb

Example of a sentence from vertical file

token lempos tag index dependent position dependency relation
Vince Vince-n NP 1 21 DEP
Hilaire Hilaire-n NP 2 15 VMOD
, ,-x , 3 15 P
one one-x CD 4 15 SBJ
of of-i IN 5 4 NMOD
the the-x DT 6 10 NMOD
first first-j JJ 7 10 NMOD
established established-j JJ 8 10 NMOD
black black-j JJ 9 10 NMOD
players player-n NNS 10 5 PMOD
in in-i IN 11 10 ADV
English English-j JJ 12 13 NMOD
football football-n NN 13 11 PMOD
, ,-x , 14 4 P
has have-v VHZ 15 0 ROOT
seen see-v VVN 16 15 VC
it it-d PP 17 19 NMOD
several several-j JJ 18 19 NMOD
times time-n NNS 19 16 OBJ
over over-i IN 20 16 ADV
. .-x SENT 21 0 ROOT

Source: Johansson, Richard. “Dependency syntax in the conll shared task 2008.”

See also: Building sketches from parsed corpora

Part-of-speech tagset

The pukWaC corpus was tagged by TreeTagger using Penn TreeBank tagset.

Other text corpora

