If you want to create a corpus with your own part-of-speech tags and lemmas, you need to upload it to Sketch Engine in a special format called vertical file. In this format (described below), you can also upload corpus data processed by external tools outside Sketch Engine to preserve their PoS tags or lemmatization.
The input format is a “vertical” or “word-per-line (WPL)” text, as defined at the University of Stuttgart in the 1990s. Words are written one word per line, so each line contains one word, number or punctuation mark. It is a plain text file in a selected character encoding, without any formatting.
Suddenly, however, their posture changed.
is in vertical text
Suddenly , however , their posture changed .
If the input text is part-of-speech-tagged and lemmatized, then we provide two additional columns, tab-separated, for tag and lemma as here (showing tags from Penn tagset):
Suddenly RB suddenly <g/> , , , however RR however <g/> , , , their PP$ their posture NN posture changed VVD change <g/> . SENT .
The “glue” tag
is used to specify that there should not be space between two tokens, as between a word and the following punctuation (in Latin and other Western scripts).
Sometimes there might be multiple or disjunctive values for an attribute, for example, if the POS-tagger was undecided between classifying a word as a noun (NN) or a lexical verb (VV), or if a word is associated with two grammatical relations. This can be encoded using a separator character as specified in the Corpus Configuration File: Overview file (attributes MULTIVALUE and MULTISEP), here “;”
brush NN;VV brush
XML tags are used for structural annotation including document, sentence or paragraph boundaries, headlines etc. and can have associated attribute-value pairs. For example:
<doc id="G10" n="32"> <head type="min"> FEDERAL CONSTITUTION <g/> , 1789 </head> <p n="1"> " <g/> we the People
There can be any number of attributes associated with words. While the ‘standard’ ones are lemma and POS-tag, the framework can also be used for starting thesaurus category, grammatical function, and a number of other varieties of markup. Sometimes this markup will be most suitably associated with a word, and sometimes with a structural attribute such as a phrase, sentence and paragraph. (There will be different implications on what searches can easily be made, depending on the choice of encoding.) For the special case of text type or ‘header’ information, see Text Types, Headers and Subcorpora.
The built-in annotation tool allows adding metadata to documents easily.
Corpus annotation and structures
Read our blog post about corpus annotation and structures in corpora.