• T-score [ statistics ]

    T-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation which is why very frequent word combinations tend to reach a high T-score despite not being significant collocations. When comparing the T-score and MI score, in most cases T-score is more useful than MI score. However, both of these scores are affected by the corpus size. This makes them less useful when working with modern mutli-billion-word corpora. This is why Sketch Engine prefers the LogDice score in most situations, especially in word sketches. see Concordance - collocations see Statistics in Sketch Engine compare MI Score logDice
  • tag [ attribute ]

    (also called part-of-speech tag, POS tag or morphological tag) is a label assigned to each token in an annotated corpus to indicate the part of speech and grammatical category. The tool used to annotate a corpus is called a tagger. A collection of tags used in a corpus is called a tagset. The most frequently used tags in a corpus are listed on the corpus information page with a link to the complete tagset. Our blog post on POS tags explains how they work.
  • tagset

    (called also tag set) is a list of part-of-speech tags used in one corpus. In Sketch Engine, corpora in the same language tend to use the same tagset but exceptions exist. To check the tagset used, access Corpus statistics and details. See our blog about POS tags.
  • TBL

    application in Sketch Engine for collecting usage-example sentences to build dictionaries. Find more on the Tick Box Lexicography page
  • term

    Terms is a concept used in connection with Keywords & Terms tool. A term is a multi-word expression (consisting of several tokens) which appears more frequently in one corpus (focus corpus) compared to another corpus (reference corpus) and, at the same time, the expression has a format of a term in the language. The format is defined in a term grammar which is specific for each language. The term grammar typically focusses on identifying noun phrases. The extracted terms are typical of the content of the corpus and can be used to identify the topic of the corpus. also see term extraction keywords
  • term base

    In connection with CAT tools, a term base is a database of subject-specific terminology and other lexical items which need to be translated consistently. The CAT tool uses the term base to check the consistency of translation, to look for untranslated segments, and to suggest (or automatically supply) translations of the terms from the database.
  • term extraction

    the process of identifying subject specific vocabulary in a subject specific text usually using specialized software. The finding of one-word and multi-word terms in Sketch Engine is based on a comparison with the frequency of these words and phrases in a reference corpus.
  • term grammar

    A term grammar is a set of rules written in CQL which define the lexical structures, typically noun phrases, which should be included in term extraction. The lexical structures are defined using POS tags. The use of a term grammar ensures a clean term extraction result which requires very little post editing. For illustration only: The term grammar for English defines terms as sequences of nouns and adjectives (noun+noun+noun, adjective+noun, adjective+adjective+noun etc.) The term grammar for Spanish contains rules such as noun+adjective, noun+de+noun, adjective+noun+de+noun etc. The actual rules are much more complex and include prepositions and articles, optional words or define which words must or must not appear before or after a lexical structure for it to be considered a term. They also check adjective-noun agreement in number, gender or case and other relevant grammatical categories. see also term keyword Best term extraction (blog) word sketch grammar
  • text analysis [ text-analysis ]

    text analysis (also content analysis or text analytics) is a method for analyzing (usually unstructured) text in order to extract information. The result of the text analysis is structured data. In addition to the traditional tools,  Sketch Engine also offers some unique features. The traditional tools consist of various frequency-based statistics:
    • word or lemma frequency, part-of-speech frequency via the wordlist tool
    • bigram, trigram, n-gram frequencies via the n-gram tool
    • absolute frequencies, relative frequencies, document frequencies, average reduced frequency (AFR)
    • phrase and multiword frequency via the concordance
    Advanced techniques include: The tools and statistics can be combined depending on the task involved. See also other text analysis tools.
  • text mining [ text-analysis ]

    text mining is an automatic process of extracting information from text, such as keywords of a text or its source(s). The corresponding tools in Sketch Engine are WebBootCaT for creating corpora from the web or keywords and terms extraction which finds terminology in your texts. Read about other text analysis tools.
  • text type

    [We follow Biber (1989) in using text type as a generic term for the many ways in which a text might be classified.] a text type refers to values assigned to structures (e.g. documents, paragraphs, sentences and others) inside a corpus. Text types are sometimes called metadata or header information. Text types can refer to the source (newspaper, book etc.), medium (spoken, written), time (year, century) or any other type of information about the text. Not all corpora have documents annotated for text types. Corpora can be divided into subcorpora based on text types and searches and other analysis can be performed only on texts belonging to the selected text type. The text type selector is used to limit the analysis to only certain text types. Text type (metadata) selector Users can include metadata into their corpora. If the metadata are in the required format, they will be converted to text types and will appear in the text type selector. Conventions for inserting metadata manually
  • TMX – Translation Memory eXchange format

    Translation Memory eXchange (TMX) is a specific XML format used for creating parallel corpora in Sketch Engine. This format is standardly used in translation memories (TM). See more about Setting up parallel corpora in Sketch Engine. An example of a TMX document (from Wikipedia), the following structures are required for creating parallel corpora: <tu>, <tuv> and <seg>:
    <tmx version="1.4">
      <header
        creationtool="XYZTool" creationtoolversion="1.01-023"
        datatype="PlainText" segtype="sentence"
        adminlang="en-us" srclang="en"
        o-tmf="ABCTransMem"/>
      <body>
        <tu>
          <tuv xml:lang="en">
            <seg>Hello world!</seg>
          </tuv>
          <tuv xml:lang="fr">
            <seg>Bonjour tout le monde!</seg>
          </tuv>
        </tu>
      </body>
    </tmx>
  • token

    A token is the smallest unit that a corpus consists of. A token normally refers to:
    • a word form: going, trees, Mary, twenty-five
    • punctuation: comma, dot, question mark, quotes…
    • digit: 50,000…
    • abbreviations, product names: 3M, i600, XP, FB…
    • anything else between spaces
    There are two types of tokens: words and nonwords. Corpora contain more tokens than words. Spaces are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language. Exceptions These general principles apply to all languages but some language-specific features may be handled differently. Here are some examples:
    • don't  in English consists of 2 tokens: do + n't.
    • Verbs with pronominal clitics in Spanish, Italian, French, Portuguese etc. count as one token (Spanish dárselo is 1 token, even though it consits of dar + se + lo)
    How to check tokenization The wordlist works on tokens only. Search for the token using the wordlist. If it is found, it is one token. If it is not found, it is not one token. See also word nonword word form
  • tokenization

    Tokenization is the automatic process of separating text into tokens. This process is performed by tools called tokenizers.
  • tokenizer

    A tokenizer is a tool (software) used for dividing text into tokens. A tokenizer is language specific and takes into account the peculiarities of the language, e.g. don't in English is tokenized as two tokens. Sketch Engine contains tokenisers for many languages and also a universal tokenizer used for languages not yet supported by Sketch Engine. The universal tokenizer only recognizes whitespace characters as token boundaries ignoring any language specific rules. This, however, is sufficient for the use of many Sketch Engine features.
  • translation memory

    A translation memory is a database inside a CAT tool which holds segments of text translated in the past. The CAT tool can suggest (or automatically supply) translations based on matching text from the translation memory.
  • trends

    Trends is a feature used for diachronic analysis, i.e. for identifying how the frequency of the word (or other attributes) changes over time. read more
  • Type/token ratio (TTR)

    The type/token ratio, often shortened TTR, is a simple measure of lexical diversity. It can only be interpreted when comparing it to TTR of a different text (corpus). The corpus with a higher TTR contains a higher variety of words than the other corpus. In other words, the authors use more different words, or richer vocabulary, than the authors of the texts in the other corpus. (more…)