If your vertical text contains only words and no annotation, a configuration can be very simple:

Example 1

PATH /corpora/test1
ATTRIBUTE word

If you omit VERTICAL, you have to specify a source file for encodevert command:

% encodevert -c test1 /corpora/src/test1.vertical

VERTICAL addition simplifies encodevert command:

% encodevert -c test2

Select an appropriate ENCODING for a proper display of characters in Sketch Engine. For each attribute you can specify a LOCALE for proper sorting and regular expression character classes handling. Default “C” locale corresponds to English. The following example uses ISO Latin 2 encoding and Czech locale.

Example 2

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
	  LOCALE "cs_CZ.ISO8859-2"
}

If your vertical text contains a POS tagging for each token (word) specify also the second attribute.

Example 3

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
	LOCALE "cs_CZ.ISO8859-2"
}
ATTRIBUTE pos

If your vertical text contains sentence boundaries annotated with <s> and </s> and document boundaries annotated with <doc> and </doc>, add structures definition.

Example 4

PATH /corpora/test2
VERTICAL "/corpora/src/test2.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word 
STRUCTURE doc
STRUCTURE s

If your <doc> annotation contains document meta-information about the author and the date of publication in form <doc author=”Lewis Carroll” date=”1876″> add structure attribute definition.

Example 5

PATH /corpora/test3
VERTICAL "/corpora/src/test3.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word 
STRUCTURE doc {
	ATTRIBUTE author
	ATTRIBUTE date
}
STRUCTURE s

If your POS attribute contains ambiguous tags like NN1-VVB in BNC, and you would like to find this tag for [pos=”NN1″] queries, add multivalue configuration.

Example 6

PATH /corpora/test4
ENCODING "iso8859-2"
ATTRIBUTE word 
ATTRIBUTE pos {
	MULTIVALUE yes
	MULTISEP "-"
}

If you would like to add a dynamic attribute, add a new attribute definition. In the following example the vertical text contains words only (one column), but the corpus has additional attribute lc generated from the word attribute. Values of lc consists of respective words transformed into lowercase letters. The transformation function is an internal function named “lowercase” (one can see the definition in stddynfun.c file). It accepts two arguments: first is a word and second a locale (in this corpus “cs_CZ”). DEFAULTATTR ensures that lc will be used in evaluation of queries without an attribute name. TRANSQUERY ensures that the transformation function will be applied to a query string before query evaluation.

Example 7

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"
DEFAULTATTR lc
ATTRIBUTE word {
	LOCALE "cs_CZ"
}
ATTRIBUTE   lc {
	LOCALE "cs_CZ"

	DYNAMIC    lowercase
	DYNLIB     internal
	FUNTYPE    s
	FROMATTR   word
	ARG1       "cs_CZ"
	TRANSQUERY yes
}

A transformation function of a dynamic attribute can also be an external function. DYNLIB then shows the full path to a dynamic library. The following example lists two dynamic attributes which add a lemma and a morphological annotation into a corpus. Both transformation functions (tags and lemmata) returns ambiguous values separated by a comma.

Example 8

PATH /corpora/test1
VERTICAL "/corpora/src/test1.vertical"
ENCODING "iso8859-2"

ATTRIBUTE   word {
	LOCALE "cs_CZ"
}
ATTRIBUTE   lemma {
	 LOCALE "cs_CZ"
	 DYNAMIC	lemmata
         DYNLIB  	/corpora/bin/alibfun.so
	 ARG1    	0
	 FUNTYPE	i
	 FROMATTR	word

	 MULTIVALUE	yes
	 MULTISEP	","
}
ATTRIBUTE   tag {
	 DYNAMIC	tags
         DYNLIB  	/corpora/bin/alibfun.so
	 FUNTYPE	0
	 FROMATTR	word

	 MULTIVALUE	yes
	 MULTISEP	","
}

Parallel corpora are handled as two separate corpora. ALIGNED indicates the name of the parallel part. Both corpora should have a structure named “align” with one to one correspondence of respective token sequences. The following example shows two configuration files — one for each corpus.

Example 9a (paren)

PATH /corpora/par-en
VERTICAL "/corpora/src/par-en.vertical"
ENCODING "iso8859-1"
ATTRIBUTE word 
STRUCTURE doc {
	ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED	  parcs

Example 9b (parcs)

PATH /corpora/par-cs
VERTICAL "/corpora/src/par-cs.vertical"
ENCODING "iso8859-2"
ATTRIBUTE word {
	LOCALE "cs_CZ"
}
STRUCTURE doc {
	ATTRIBUTE id
}
STRUCTURE s
STRUCTURE align
ALIGNED	  paren

The final example is a part of a BNC configuration. It shows usage of INFO and FULLREF.

Example 10

PATH   /corpora/bnc
INFO   "British National Corpus"
VERTICAL /corpora/src/bnc.vert
ENCODING "iso8859-1"

DEFAULTATTR lc

FULLREF "bncdoc.id,bncdoc.author,bncdoc.title,bncdoc.date,bncdoc.info"

ATTRIBUTE   word
ATTRIBUTE   tag {
	MULTIVALUE y
	MULTISEP   "-"
}

ATTRIBUTE   lc {
	DYNAMIC lowercase
	DYNLIB  internal
	FUNTYPE s
	ARG1    "C"
	FROMATTR word
	TRANSQUERY	yes
}
	
STRUCTURE   bncdoc {
	ATTRIBUTE id
	ATTRIBUTE date
	ATTRIBUTE year {
		DYNAMIC firstn
		DYNLIB  internal
		FUNTYPE i
		ARG1    4
		FROMATTR date
	}
	ATTRIBUTE author {
		MULTIVALUE y
		MULTISEP   ";"
	}
	ATTRIBUTE title
	ATTRIBUTE info

	ATTRIBUTE allava
	ATTRIBUTE alltim
	ATTRIBUTE alltyp

	ATTRIBUTE wriaag
	ATTRIBUTE wriad
	ATTRIBUTE wriase
}

STRUCTURE   stext {
	ATTRIBUTE org
}
STRUCTURE   text {
	ATTRIBUTE org
}

STRUCTURE   s {
	ATTRIBUTE n
}

STRUCTURE   p {
	ATTRIBUTE rend
}
STRUCTURE   body 

Naming structures and attributes

Names of structures and attributes must not contain other characters than a-z, A-Z, 0-9, underscore. Names not beginning with a-z must be double quoted. Positional attributes word, tag, lempos, lemma should not be renamed. Correct examples:

ATTRIBUTE word
STRUCTURE doc {
	ATTRIBUTE title1
	ATTRIBUTE "Title2"
}

 Navigation