• T-score [ statistics ]

    T-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation which is why very frequent word combinations tend to reach a T-score high value despite not being significant as collocations. In most cases, T-score is more reliable or more useful than MI Score. see Concordance - collocations see Statistics in Sketch Engine compare MI Score
  • tag [ attribute ]

    (also called morphological tag or POS tag) a label assigned to each token in an annotated corpus to indicate the part of speech and grammatical category. The tool used to annotate a corpus is called a tagger. A collection of tags used in a corpus is called a tagset. See our blog about POS tags.
  • tagset

    (called also tag set) is a list of part-of-speech tags used in one corpus. In Sketch Engine, corpora in the same language tend to use the same tagset but exceptions exist. To check the tagset used, access Corpus statistics and details. See our blog about POS tags.
  • TBL

    application in Sketch Engine for collecting usage-example sentences to build dictionaries. Find more on the Tick Box Lexicography page
  • term

    a keyword or multi-word term that is more frequent in one corpus compared to another one and at the same time it is not a common word(s) like "the, house, at the, ...". Hence, this is the term significant for the corpus. See more on term extraction»
  • term base

    In connection with CAT tools, a term base is a database of subject-specific terminology and other lexical items which need to be translated consistently. The CAT tool uses the term base to check the consistency of translation, to look for untranslated segments, and to suggest (or automatically supply) translations of the terms from the database.
  • term extraction

    the process of identifying subject specific vocabulary in a subject specific text usually using specialized software. The finding of one-word and multi-word terms in Sketch Engine is based on a comparison with the frequency of these words and phrases in a reference corpus.
  • text analysis [ text-analysis ]

    text analysis (also content analysis) is a method for analyzing texts in order to gain information from them. The result of the content analysis is structured data which can be used for further processing. Sketch Engine offers a one-page automatic summary of a word's collocations with the word sketch feature. See also other text analysis tools.
  • text mining [ text-analysis ]

    text mining is an automatic process of extracting information from text, such as keywords of a text or its source(s). The corresponding tools in Sketch Engine are WebBootCaT for creating corpora from the web or keywords and terms extraction which finds terminology in your texts. Read about other text analysis tools.
  • text type

    a text type is a term used when talking about text corpora which refers to values assigned to structures (e.g. documents, paragraphs, sentences and others) inside a corpus. Text types are sometimes called metadata or headers. Text types can refer to the source (newspaper, book etc.), medium (spoken, written), time (year, century) or any other type of information about text. Not all corpora have documents annotated for text types. Corpora can be divided into subcorpora based on text types and searches and other analysis can be performed only on texts belonging to the selected text type.
  • token

    Token is the smallest unit that each corpus divides to. Typically each word form and punctuation (comma, dot, ...) is a separate token (but don't  in English consists of 2 tokens). Therefore, corpora contain more tokens than words. Spaces between words are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.
  • tokenization

    Tokenization is the automatic process of separating text into tokens.
  • tokenizer

    A tokenizer is a tool (software) used for dividing text into tokens. A tokenizer is language specific and takes into account the peculiarities of the language, e.g. don't in English is tokenized as two tokens. Sketch Engine contains tokenisers for many languages and also a universal tokenizer used for languages not yet supported by Sketch Engine. The universal tokenizer only recognizes whitespace characters as token boundaries ignoring any language specific rules. This, however, is sufficient for the use of many Sketch Engine features.
  • translation memory

    A translation memory is a database inside a CAT tool which holds segments of text translated in the past. The CAT tool can suggest (or automatically supply) translations based on matching text from the translation memory.
  • trends

    Trends is a feature used for diachronic analysis, i.e. for identifying how the frequency of the word (or other attributes) changes over time. read more