• salience [ statistics ]

    a statistical measure of the significance of a specific token in the given context. This is measured with logDice, for more information, see section 3 of Statistics used in Sketch Engine)
  • search attribute

    the attribute that is used for the search and creating a word list. You can have the word list of words, lemmas, tags, etc.
  • search span

    the number of tokens either side of the node that will be matched for filtering concordance. The set search span from -5 to 5 means filter all concordance lines which containing a requirement of the filter in the range of 5 tokens around the node.
  • segment

    Segments refer to the parts into which a parallel (multilingual) corpus is divided for the purpose of alignment. Alignment means that the corpus contains information about which segment in one language is a translation of which segment in another language. Segments typically correspond to sentences but some corpora can be aligned at a paragraph or document level. The shorter the segments, the easier is to locate the translated word or phrase in the segment.
  • simple maths [ statistics ]

    The simple maths formula is used to calculate the keyness score in Sketch Engine. This score is used to identify terms, keywords and also key n-grams and key collocations. It identifies items which appear more frequently in the focus corpus than in the reference corpus. It uses relative (per million) frequencies and, therefore,  makes it possible to contrast corpora of unequal sizes. see Simple maths.
  • stem [ attribute ]

    A stem is a part of a word without its affixes (suffixes, prefixes, etc.). Stems do not have to be valid word forms, e.g. stem hav for the word form having, in comparison to lemma have for the word form having. Stems are used instead of lemmas or in addition to lemmas with languages whose morphology requires it. An example are agglutinating languages such as Turkish, Hungarian or Japanese.
  • stemming

    stemming is the process during which a word reduces its affixes (suffixes, prefixes, etc.) and finally, the stem only remains. Stemming is used to detect related words with the same stem, the word root which does not change in any case, number or tense. The word stems are available in Portuguese corpus ptTenTen or Turkis corpus trTenTen. This analysis is processed with tools calle stemmers. Stemming is also used instead of lemmatization with aglutinating langauges such as Hungarian or Turkish. See also PoS tagger lemmatization
  • structure

    a corpus structure refers to the segments or parts into which a corpus can be divided. Typically, a corpus is divided into sentences, paragraphs and documents but the corpus author can introduce various other structures to allow the analysis to focus on smaller or larger parts of the corpus. see a list of common corpus structures see Dividing a corpus into smaller parts and annotating them
  • subcorpus

    a corpus can be subdivided into an unlimited number of parts called subcorpora. Subcorpora can be used to divide the corpus by the type (fiction, newspaper), media (spoken, written) or time (e.g. by years) or by any other criteria. A subcorpus can also be created from a concordance by including all concordance lines and the documents they come from into a subcorpus. A subcorpus can be selected on the advanced tab of most of the tools (except for word sketch differences and thesaurus). Selecting a corpus will restrict the search or the analysis to only this subcorpus. How to create a subcorpus»