The input format is “vertical” or “word-per-line (WPL)” text, as defined at the University of Stuttgart in the 1990s. Words are written one word per line, so each line contains one word, number or punctuation mark. It is a plain text file in a selected character encoding, without any formatting.

Suddenly, however, their posture changed.

is in vertical text


If the input text is part-of-speech-tagged and lemmatised, then we provide two additional columns, tab-separated, for tag and lemma as here (showing tags from Penn tagset):

Suddenly	RB	suddenly
,	,	, 
however	RR	however 
,	,	, 
their	PP$	their 
posture	NN	posture 
changed	VVD	change 
.	SENT	.

The “glue” tag <g/> is used to specify that there should not be space between two tokens, as between a word and the following punctuation (in Latin and other Western scripts).

Sometimes there might be multiple or disjunctive values for an attribute, for example, if the POS-tagger was undecided between classifying a word as a noun (NN) or a lexical verb (VV), or if a word is associated with two grammatical relations. This can be encoded using a separator character as specified in the Corpus Configuration File: Overview file (attributes MULTIVALUE and MULTISEP), here “;”

brush   NN;VV    brush

XML tags are use for structural annotation including document, sentence or paragraph boundaries, headlines etc. and can have associated attribute-value pairs. For example:

<doc id="G10" n="32"> 
<head type="min"> 
<p n="1"> 

There can be any number of attributes associated with words. While the ‘standard’ ones are lemma and POS-tag, the framework can also be used for starting thesaurus category, grammatical function, and a number of other varieties of markup. Sometimes this markup will be most suitably associated with a word, and sometimes with a structural attribute such as a phrase, sentence and paragraph. (There will be different implications on what searches can easily be made, depending on the choice of encoding.) For the special case of text type or ‘header’ information, see Text Types, Headers and Subcorpora.