Simple maths is the keyness score used in Sketch Engine to identify keywords, terms, key n-grams and key word sketch collocations. Simple maths compares the frequencies in the focus corpus with the frequencies in the reference corpus. Alternatively, two subcorpora in the same corpus or in different corpora can be used.
The N value makes the score prefer more frequent or less frequent items.
A higher N value shifts to focus on higher-frequency words (more common words), whereas a lower N value focusses on low-frequency (rarer words). The value should be changed in orders of magnitude, i.e. 0.1, 1, 10, 100, 1000, 10000 etc. Smaller changes rarely produce any noticeable effect.
The statistics is a variation on “word W is so-and-so times more frequent in corpus X than corpus Y”. The formula is:
where
is the normalized (per million) frequency of the word in the focus corpus,
is the normalized (per million) frequency of the word in the reference corpus,
is the smoothing parameter (
is the default value).
Example
Your focus corpus (BNC): 112,289,776 tokens
Frequency of the lemma (shard) in the corpus: 35
Relative frequency
Selected reference corpus (ukWaC): 1,559,716,979 tokens
Frequency of the lemma (shard) in the corpus: 263
Keyness score
For more details see:
Adam Kilgarriff. Simple maths for keywords. In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.
Statistic used in Sketch Engine (Chapter 5). Lexical Computing Ltd., 8 July 2015.