• T-score [ statistics ]

    T-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation which is why very frequent word combinations tend to reach a high T-score despite not being significant collocations. In most cases, T-score is more reliable or more useful than MI Score. see Concordance - collocations see Statistics in Sketch Engine compare MI Score
  • tag [ attribute ]

    (also called part-of-speech tag, POS tag or morphological tag) is a label assigned to each token in an annotated corpus to indicate the part of speech and grammatical category. The tool used to annotate a corpus is called a tagger. A collection of tags used in a corpus is called a tagset. The most frequently used tags in a corpus are listed on the corpus information page with a link to the complete tagset. Our blog post on POS tags explains how they work.
  • tagset

    (called also tag set) is a list of part-of-speech tags used in one corpus. In Sketch Engine, corpora in the same language tend to use the same tagset but exceptions exist. To check the tagset used, access Corpus statistics and details. See our blog about POS tags.
  • TBL

    application in Sketch Engine for collecting usage-example sentences to build dictionaries. Find more on the Tick Box Lexicography page
  • term

    Terms is a concept used in connection with Keywords & Terms tool. A term is a multi-word expression (consisting of several tokens) which appears more frequently in one corpus (focus corpus) compared to another corpus (reference corpus) and, at the same time, the expression has a format of a term in the language. The format is defined in a term grammar which is specific for each language. The term grammar typically focusses on identifying noun phrases. The extracted terms are typical of the content of the corpus and can be used to identify the topic of the corpus. also see term extraction keywords
  • term base

    In connection with CAT tools, a term base is a database of subject-specific terminology and other lexical items which need to be translated consistently. The CAT tool uses the term base to check the consistency of translation, to look for untranslated segments, and to suggest (or automatically supply) translations of the terms from the database.
  • term extraction

    the process of identifying subject specific vocabulary in a subject specific text usually using specialized software. The finding of one-word and multi-word terms in Sketch Engine is based on a comparison with the frequency of these words and phrases in a reference corpus.
  • term grammar

    A term grammar is a collection of rules written in CQL which define the lexical structures, typically noun phrases, which should be included in term extraction. The term grammar uses POS tags and this is why term extraction is only available for tagged corpora. The use of a term grammar ensures clean term extraction result which requires very little post editing. see also term keyword Best term extraction (blog)
  • text analysis [ text-analysis ]

    text analysis (also content analysis) is a method for analyzing texts in order to gain information from them. The result of the content analysis is structured data which can be used for further processing. Sketch Engine offers a one-page automatic summary of a word's collocations with the word sketch feature. See also other text analysis tools.
  • text mining [ text-analysis ]

    text mining is an automatic process of extracting information from text, such as keywords of a text or its source(s). The corresponding tools in Sketch Engine are WebBootCaT for creating corpora from the web or keywords and terms extraction which finds terminology in your texts. Read about other text analysis tools.
  • text type

    a text type refers to values assigned to structures (e.g. documents, paragraphs, sentences and others) inside a corpus. Text types are sometimes called metadata or headers. Text types can refer to the source (newspaper, book etc.), medium (spoken, written), time (year, century) or any other type of information about text. Not all corpora have documents annotated for text types. Corpora can be divided into subcorpora based on text types and searches and other analysis can be performed only on texts belonging to the selected text type.
  • token

    Token is the smallest unit that each corpus divides to. Typically each word form and punctuation (comma, dot, ...) is a separate token (but don't  in English consists of 2 tokens). Therefore, corpora contain more tokens than words. Spaces between words are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.
  • tokenization

    Tokenization is the automatic process of separating text into tokens.
  • tokenizer

    A tokenizer is a tool (software) used for dividing text into tokens. A tokenizer is language specific and takes into account the peculiarities of the language, e.g. don't in English is tokenized as two tokens. Sketch Engine contains tokenisers for many languages and also a universal tokenizer used for languages not yet supported by Sketch Engine. The universal tokenizer only recognizes whitespace characters as token boundaries ignoring any language specific rules. This, however, is sufficient for the use of many Sketch Engine features.
  • translation memory

    A translation memory is a database inside a CAT tool which holds segments of text translated in the past. The CAT tool can suggest (or automatically supply) translations based on matching text from the translation memory.
  • trends

    Trends is a feature used for diachronic analysis, i.e. for identifying how the frequency of the word (or other attributes) changes over time. read more