Glossary

You are here: Home1 / User Guide2 / Glossary

Search:

(clear)

ALDF – Average Logarithmic Distance Frequencyis a modified frequency that prevents the result from being excessively influenced by one part of the corpus (e.g. one or more documents) that contain a high concentration of the token. If the token is evenly distributed across the corpus, ALDF and absolute frequency will be similar or [...] Read More
ARF – Average Reduced Frequencya modified frequency which prevents the result to be excessively influenced by one part of the corpus (e.g. one or more documents) which contains a high concentration of the token. If the token is evenly distributed across the corpus, ARF and absolute frequency will be similar or identical. [...] Read More
document frequency (docf)Document frequency is the number of documents in which a token or phrase appears. If the corpus has 100 documents and 2 documents contain the word city: document number 7 contains 17 instances of city, document number 31 contains 6 instances of city, the document frequency of city is [...] Read More
relative frequency, frequency per million(also called freq/mill in the interface) is the number of occurrences of an item per million tokens, also called i.p.m. (instances per million). It is used to compare frequencies between corpora (or datasets) of different sizes.
Formula
number of hits : corpus size in millions of tokens = [...] Read More
frequencyFrequency (also absolute frequency) refers to the number of occurrences or hits. If a word, phrase, tag etc. has a frequency of 10, it means it was found 10 times or it exists 10 times. It is an absolute figure. It is not calculated using a specific formula. compare frequency per [...] Read More
likelihoodis a function of parameters of a statistical model. It plays a key role in statistical inference and is the basis for the log-likelihood function. see Statistics in Sketch Engine
log-likelihoodis one of the functions used in the computed statistics of Sketch Engine. It is an association measures based on the likelihood function and is used in tests of significance (see the log-likelihood calculator and more details).
logDiceis a statistical measure for identifying co-occurrence (=two items appearing together). Sketch Engine uses it to identify collocations. It expresses the typicality (or strength) of the collocation. It is used in the word sketch feature and also when computing collocations from a [...] Read More
MI ScoreThe Mutual Information score expresses the extent to which words co-occur compared to the number of times they appear separately. MI Score is affected strongly by the frequency, low-frequency words tend to reach a high MI score which may be misleading. !--more--This is why Sketch Engine [...] Read More
minimum sensitivity
A statistics measure similar to logDice which is the minimum of the two following numbers:
- the number of co-occurrences divided by the frequency of the collocate
- the number of co-occurrences divided by the frequency of the node word
The minimum sensitivity number grows with a [...]
Read More
overall scoreThe score of the relation based on logDice in word sketches. The score is displayed in the header of each column of the relation.
relative text type frequency(also called Relative density in the interface) Relative text type frequency compares the frequency in a specific text type to the frequency in the whole corpus. It shows how typical the word(s) is of a specific text type, e.g. of the spoken part of the corpus or of a particular website from [...] Read More
salienceis a statistical measure of the significance of a specific token in a given context. It is measured using logDice. For more information, see section 3 of Statistics used in Sketch Engine.
simple mathsThe simple maths formula is used to calculate the keyness score in Sketch Engine. This score is used to identify terms, keywords and also key n-grams and key collocations. It identifies items which appear more frequently in the focus corpus than in the reference corpus. It uses relative (per [...] Read More
T-scoreT-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation, which is why very frequent word combinations tend to reach a high T-score despite [...] Read More
Type/token ratio (TTR)The type/token ratio, often shortened TTR, is a simple measure of lexical diversity. It can only be interpreted when comparing it to TTR of a different text (corpus). The corpus with a higher TTR contains a higher variety of words than the other corpus. In other words, the authors use more [...] Read More