Average Reduced Frequency (ARF) is a variant on a frequency list that ‘discounts’ multiple occurrences of a word that occur close to each other, e.g. in the same document.

Definition

Let the corpus be split to k non-overlapping parts of the same length, where k is the corpus frequency of the word and the first part starts at a certain position in the corpus. A reduced word frequency is the number of parts containing the word. There are as many reduced frequencies as the length of a part, since the start of the first part can be moved over the corpus until it reaches a split previously seen. The ARF is the average of these reduced frequencies.

Example

Let us have a corpus comprised of 60 tokens and a word occurring 5 times in the corpus (at positions 0, 11, 13, 16, 56).

+----------+-+--+---------------------------------------+---

According to the definition, the corpus is split to 5 parts of length 60/5 = 12.

|+----------+|-+--+-------|------------|------------|--------+---

Number of parts containing the word is the reduced frequency of the word – 3 in this case.

|+----------+|-+--+-------|------------|------------|--------+---
|  present   |  present   |  missing   |  missing   |  present     

There are 12 ways to split the corpus to 5 parts of the same length, thus 12 reduced frequencies are calculated.

|+----------+|-+--+-------|------------|------------|--------+--- RF = 3
+|----------+-|+--+--------|------------|------------|-------+--- RF = 3
+-|---------+-+|--+---------|------------|------------|------+--- RF = 3
+--|--------+-+-|-+----------|------------|------------|-----+--- RF = 3
+---|-------+-+--|+-----------|------------|------------|----+--- RF = 3
+----|------+-+--+|------------|------------|------------|---+--- RF = 2
+-----|-----+-+--+-|------------|------------|------------|--+--- RF = 2
+------|----+-+--+--|------------|------------|------------|-+--- RF = 2
+-------|---+-+--+---|------------|------------|------------|+--- RF = 2
+--------|--+-+--+----|------------|------------|-----------+|--- RF = 3
+---------|-+-+--+-----|------------|------------|----------+-|-- RF = 3
+----------|+-+--+------|------------|------------|---------+--|- RF = 3

The average RF is ARF = (3 + 3 + 3 + 3 + 3 + 2 + 2 + 2 + 2 + 3 + 3 + 3) / 12 = 2.67. Apart from the raw corpus frequency of the word, which is 5, the ARF value discounts close occurrences of the word and estimates the word frequency would be only 2.67 in a homogeneous corpus.

References

Savický, Petr and Hlaváčová, Jaroslava. 2002. Measures of word commonness. Journal of Quantitative Linguistics, 9: pp. 215–231.

Hlaváčová, Jaroslava, and Pavel Rychlý. Dispersion of words in a language corpus. In Text, Speech and Dialogue, pp. 321–324. Springer Berlin Heidelberg, 1999.

Hlaváčová, Jaroslava. New Approach to Frequency Dictionaries — Czech Example.