If your vertical text contains only words and no annotation, a configuration can be very simple:
Example 1
PATH /corpora/test1 ATTRIBUTE word
If you omit VERTICAL, you have to specify a source file for encodevert command:
% encodevert -c test1 /corpora/src/test1.vertical
VERTICAL addition simplifies encodevert command:
% encodevert -c test2
Select an appropriate ENCODING for a proper display of characters in Sketch Engine. For each attribute you can specify a LOCALE for proper sorting and regular expression character classes handling. Default “C” locale corresponds to English. The following example uses ISO Latin 2 encoding and Czech locale.
Example 2
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ.ISO8859-2"
}
If your vertical text contains a POS tagging for each token (word) specify also the second attribute.
Example 3
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ.ISO8859-2"
}
ATTRIBUTE pos
If your vertical text contains sentence boundaries annotated with <s> and </s> and document boundaries annotated with <doc> and </doc>, add structures definition.
Example 4
PATH /corpora/test2 VERTICAL "/corpora/src/test2.vertical" ENCODING "iso8859-2" ATTRIBUTE word STRUCTURE doc STRUCTURE s
If your <doc> annotation contains document meta-information about the author and the date of publication in form <doc author=”Lewis Carroll” date=”1876″> add structure attribute definition.
Example 5
PATH /corpora/test3
VERTICAL "/corpora/src/test3.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word
STRUCTURE doc {
ATTRIBUTE author
ATTRIBUTE date
}
STRUCTURE s
If your POS attribute contains ambiguous tags like NN1-VVB in BNC, and you would like to find this tag for [pos=”NN1″] queries, add multivalue configuration.
Example 6
PATH /corpora/test4
ENCODING "iso8859-2"
ATTRIBUTE word
ATTRIBUTE pos {
MULTIVALUE yes
MULTISEP "-"
}
If you would like to add a dynamic attribute, add a new attribute definition. In the following example the vertical text contains words only (one column), but the corpus has additional attribute lc generated from the word attribute. Values of lc consists of respective words transformed into lowercase letters. The transformation function is an internal function named “lowercase” (one can see the definition in stddynfun.c file). It accepts two arguments: first is a word and second a locale (in this corpus “cs_CZ”). DEFAULTATTR ensures that lc will be used in evaluation of queries without an attribute name. TRANSQUERY ensures that the transformation function will be applied to a query string before query evaluation.
Example 7
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
DEFAULTATTR lc
ATTRIBUTE word {
LOCALE "cs_CZ"
}
ATTRIBUTE lc {
LOCALE "cs_CZ"
DYNAMIC lowercase
DYNLIB internal
FUNTYPE s
FROMATTR word
ARG1 "cs_CZ"
TRANSQUERY yes
}
A transformation function of a dynamic attribute can also be an external function. DYNLIB then shows the full path to a dynamic library. The following example lists two dynamic attributes which add a lemma and a morphological annotation into a corpus. Both transformation functions (tags and lemmata) returns ambiguous values separated by a comma.
Example 8
PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ"
}
ATTRIBUTE lemma {
LOCALE "cs_CZ"
DYNAMIC lemmata
DYNLIB /corpora/bin/alibfun.so
ARG1 0
FUNTYPE i
FROMATTR word
MULTIVALUE yes
MULTISEP ","
}
ATTRIBUTE tag {
DYNAMIC tags
DYNLIB /corpora/bin/alibfun.so
FUNTYPE 0
FROMATTR word
MULTIVALUE yes
MULTISEP ","
}
Parallel corpora are handled as two separate corpora. ALIGNED indicates the name of the parallel part. Both corpora should have a structure named “align” with one to one correspondence of respective token sequences. The following example shows two configuration files — one for each corpus.
Example 9a (paren)
PATH /corpora/par-en
VERTICAL "/corpora/src/par-en.vertical"
ENCODING "iso8859-1"
ATTRIBUTE word
STRUCTURE doc {
ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED parcs
Example 9b (parcs)
PATH /corpora/par-cs
VERTICAL "/corpora/src/par-cs.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
LOCALE "cs_CZ"
}
STRUCTURE doc {
ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED paren
The final example is a part of a BNC configuration. It shows usage of INFO and FULLREF.
Example 10
PATH /corpora/bnc
INFO "British National Corpus"
VERTICAL /corpora/src/bnc.vert
ENCODING "iso8859-1"
DEFAULTATTR lc
FULLREF "bncdoc.id,bncdoc.author,bncdoc.title,bncdoc.date,bncdoc.info"
ATTRIBUTE word
ATTRIBUTE tag {
MULTIVALUE y
MULTISEP "-"
}
ATTRIBUTE lc {
DYNAMIC lowercase
DYNLIB internal
FUNTYPE s
ARG1 "C"
FROMATTR word
TRANSQUERY yes
}
STRUCTURE bncdoc {
ATTRIBUTE id
ATTRIBUTE date
ATTRIBUTE year {
DYNAMIC firstn
DYNLIB internal
FUNTYPE i
ARG1 4
FROMATTR date
}
ATTRIBUTE author {
MULTIVALUE y
MULTISEP ";"
}
ATTRIBUTE title
ATTRIBUTE info
ATTRIBUTE allava
ATTRIBUTE alltim
ATTRIBUTE alltyp
ATTRIBUTE wriaag
ATTRIBUTE wriad
ATTRIBUTE wriase
}
STRUCTURE stext {
ATTRIBUTE org
}
STRUCTURE text {
ATTRIBUTE org
}
STRUCTURE s {
ATTRIBUTE n
}
STRUCTURE p {
ATTRIBUTE rend
}
STRUCTURE body
Naming structures and attributes
Names of structures and attributes must not contain other characters than a-z, A-Z, 0-9, underscore. Names not beginning with a-z must be double quoted. Positional attributes word, tag, lempos, lemma should not be renamed. Correct examples:
ATTRIBUTE word
STRUCTURE doc {
ATTRIBUTE title1
ATTRIBUTE "Title2"
}




