This page describes how to prepare a text corpus for indexation by the Manatee corpus management system used as the underlying database backend in Sketch Engine.

Text corpus from a technical point of view

The informal definition of a text corpus usually boils down to something close to “any collection of texts in electronic form”. From a more formal account, a corpus source text consists of:

    • positions, i.e. individual occurrences of tokens in the texts, where each position has some associated attributes like word, lemma or tag
    • structures, i.e. corpus segments (ranges) spanning a part of a corpus and being defined by their beginning and ending position, usually denoting documents, paragraphs or sentences.
    • structure attributes, i.e. attributes of individual structures containing metadata of these structures like date of creation, author etc.

Structures and structure attributes are sometimes referred to as headers or corpus metadata.

The example below illustrates the notions defined above on a sample vertical text:

DESCRIPTION                                      CORPUS VERTICAL TEXT

Begin of structure "doc"
with 2 structure attributes "author" and "year": <doc author="Shakespeare" year="1603">
Begin of sucture "p" for a paragraph:            <p>
Begin of structure "s" for a sentence:           <s>
Position #0 -- all positions have 3 attributes
separated by a tabulator.                        To        to        PREPOSITION
Position #1                                      be        be        VERB
Empty structure "g" denoting a "glue" 
(no space separation) between tokens:            <g/>
Position #2                                      ,         ,         PUNCTUATION
Position #3                                      or        or        CONJUNCTION
Position #4                                      not       not       PARTICLE
Position #5                                      to        to        PREPOSITION
Position #6                                      be        be        VERB
Empty structure "g"                              <g/>
Position #7                                      ,         ,         PUNCTUATION
Position #8                                      that      that      PRONOUN
Position #9                                      is        be        VERB
Position #10                                     the       the       DETERMINER
Position #11                                     question  question  NOUN
Empty structure "g"                              <g/>
Position #12                                     .         .         PUNCTUATION
End of the last structure "s"                    </s>
End of the last structure "p"                    </p>
End of the last structure "doc"                  </doc>

Steps to prepare a text corpus for Sketch Engine

  1. Prepare the source data, including both
  2. Prepare the corpus configuration file
  3. (optionally) Prepare the subcorpus configuration file
    This step is needed if you wish to compile subcorpora which can be shared by multiple users
  4. (optionally) Prepare or reuse a word sketch definition file
    This step is needed if you require word sketches or thesaurus (the thesaurus takes the word sketch database as input).
  5. Compile (index) the corpus
  6. Verify corpus consistency, integrity and completeness