• ARF – Average Reduced Frequency [ statistics ]

    a modified frequency which prevents the result to be excessively influenced by one part of the corpus (e.g. one or more documents) which contains a high concentration of the token. If the token is evenly distributed across the corpus, ARF and frequency per million will be comparable. see also ARF definition  
  • document frequency [ statistics ]

    The document frequency is the number of documents in which a word or phrase appears irrespective of how many times. If the corpus has 100 documents and 2 documents contain the word city: document 1 contains 17 instances and document 2 contains 6 instances, the document frequency of city is 2. It is not important how many documents the corpus contains or how many times the word appears in total. The document frequency can be better suited for comparison in situations when the corpus contains a small number of documents with an extremely high frequency of particular words. see also frequency frequency per million ARF Statistics used in Sketch Engine
  • freq/mill – frequency per million [ statistics ]

    a number of occurrences (hits) of an item normalised per million, also called as i.p.m. (instances per million). It is used to compare frequencies between corpora of different sizes. number of hits : corpus size in millions of tokens = frequency per million Example: A token found 10 times in a corpus of 1 million tokens will have a frequency per million equal to 10. A token found 100 times in a corpus of 100 million tokens will have a frequency per million equal to 1. The second token is less frequent. see also Statistics in Sketch Engine Frequency per million Average Reduced Frequency
  • frequency [ statistics ]

    Frequency (also absolute frequency) refers to the number of occurrences or hits. If a word, phrase, tag etc. has a frequency of 10, it means it was found 10 times or it exists 10 times. It is an absolute figure. It is not calculated using a specific formula. compare frequency per million see also ARF document frequency Statistics used in Sketch Engine
  • likelihood [ statistics ]

    a function of parameters of a statistical model, it plays a key role in statistical inference and is the basis for the log-likelihood function. see Statistics in Sketch Engine
  • log-likelihood [ statistics ]

    one of the functions used in computed statistics of Sketch Engine. It is the association measures based on the likelihood function, using in tests for significance (see the log-likelihood calculator and more details)
  • logDice [ statistics ]

    a statistic measure for identifying collocations. It expresses the typicality of the co-occurence of the node and the collocate. It is used in the word sketch feature and also when computing collocations from a concordance. It is only based on the frequency of the node and the collocate and the frequency of the whole collocation. logDice is not affected by the size of the corpus and, therefore, can be used to compare the scores between different corpora. logDice is the preferred option when working with large corpora.   see also logDice in Statistics used in Sketch Engine A Lexicographer-Friendly Association Score (paper) T-score MI score
  • MI Score [ statistics ]

    The Mutual Information score expresses the extent to which words co-occur compared to the number of times they appear separately. MI Score is affected strongly by the frequency, low-frequency words tend to reach a high MI score which may be misleading. This is why Sketch Engine allows setting a frequency limit so that low-frequency words can be excluded from the calculation. In most cases T-score is more useful than MI score. see Concordance - Collocations see Statistics in Sketch Engine compare T-score
  • minimum sensitivity [ statistics ]

    a statistics measure similar to logDice which is the minimum of the two following numbers:

    • the number of co-occurrences divided by the frequency of the collocate
    • the number of co-occurrences divided by the frequency of the node word

    The minimum sensitivity number grows with a high number of co-occurrences and falls with a high number of occurrences of the individual words (node word or collocate).

  • overall score [ statistics ]

    score of the relation based on logDice in word sketches. The score is displayed in the header of each column of the relation.
  • salience [ statistics ]

    a statistical measure of the significance of a specific token in the given context. This is measured with logDice, for more information, see section 3 of Statistics used in Sketch Engine)
  • simple math [ statistics ]

    the simple formula used for the computation and identification of terms and keywords. see Simple math.
  • T-score [ statistics ]

    T-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation which is why very frequent word combinations tend to reach a high T-score despite not being significant collocations. In most cases, T-score is more reliable or more useful than MI Score. see Concordance - collocations see Statistics in Sketch Engine compare MI Score