Average Logarithmic Distance Frequency (ALDF) is a type of corrected frequency (also called adjusted frequency or modified frequency) that can be displayed for the results of word lists, n-grams and keywords and term extraction.

This modified frequency indicates whether a token is distributed evenly throughout the whole corpus or its occurrences are close to each other, e.g. in one or only a few documents of the whole corpus. The more similar ALDF is to absolute frequency, the more evenly distributed the token is. If an absolute frequency and ALDF are the same, then the token is perfectly widespread through the whole corpus.

The calculation of Average Logarithmic Distance Frequency prevents the results from being excessively influenced by a high concentration of a token occurring only in one or more small parts of the corpus. In comparison with Average Reduced Frequency (ARF), ALDF is based on distances between the tokens.

ALDF in practice

In practice, the modified frequency ALDF can be applied in creating dictionaries based on corpora. This is related to the fact that using absolute frequency for selecting words as the dictionary entries is insufficient.

For instance, in the British National Corpus (BNC), the word list of lemmas contains the word “hon.” on the 1036th position which is an abbreviated form of the word “honourable” standing on the 6430th position of this list. The absolute frequency of “hon.” is 10,546 whereas ALDF is only 58. (This means that the word occurs only in 91 documents of all 4,054 documents in BNC.) On the contrary, the word “honourable” has a lower absolute frequency of 862, but ALDF is 232 which indicates it will be widely distributed throughout the corpus (345 documents) and thus language respectively.

In some concepts, there are also used terms word dispersion and word commonness for describing (un)evenness of word distributions in the whole corpus.

Definition

ALDF is based on Average Logarithmic Distance (ALD) describing the distance between occurrences of the search token. Therefore the token should be expected at the Nth position within the corpus. In other words, ALD says that every Nth token should be the token you are searching for. This concept corresponds to perplexity which is a probability distribution used for predicting a sample.

On the other hand, ALDF refers to the average distances between the searched token and thus we can call it entropy (information theory) referring to the average level of “uncertainty” that the token appears at the Nth position within the corpus.

The formula for ALD is

where

N – size of the corpus

f – frequency of the token

d – distance between the tokens

The formula for ALDF is based on ALD

Example

Let’s imagine a corpus with 80 tokens and a word you found has absolute frequency 5. The distribution of the word you found within the corpus is even which means the occurrence of the word is on each 16th position. The table below contains 80 numbers divided into cells one by one to express the corpus with 80 tokens. Our word (expressed by the orange cells) occurs on the 16th, 32th, 48th, 64th, and 80th positions in the corpus.

1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80

In this case, the formula for ALDF will be

because ALD is counted as follows

Then the absolute frequency 5 is the same as ALDF (80 / 16 = 5).

References

Savický, Petr and Hlaváčová, Jaroslava. 2002. Measures of word commonness. Journal of Quantitative Linguistics, 9: pp. 215–231.

Hlaváčová, Jaroslava, and Pavel Rychlý. Dispersion of words in a language corpus. In Text, Speech and Dialogue, pp. 321–324. Springer Berlin Heidelberg, 1999.

Hlaváčová, Jaroslava. New Approach to Frequency Dictionaries — Czech Example.

Statistics in Sketch Engine

Explore statistics in more detail to understand Sketch Engine functions. The functions are based on mathematical and statistical computations which enable users to accurately search and filter queries in language corpora.