Simple maths is a method for identifying keywords of one corpus vs another. It includes a variable which allows the user to turn the focus either on higher, or lower frequency words.

Generally, a higher value (100, 1000, …) of Simple maths focuses on higher-frequency words (more common words), whereas a lower value (1, 0.1, …) of Simple maths focusses on low-frequency (more rare words).

The statistic used  for keywords is a variation on “word W is so-and-so times more frequent in corpus X than corpus Y”. The keyness score of a word is calculated according to the following formula:

$\frac{fpm_{rm&space;focus}&space;+&space;N}{fpm_{rm&space;ref}&space;+&space;N}$

where

$fpm_{rm&space;focus}$ is the normalized (per million) frequency of the word in the focus corpus,

$fpm_{rm&space;ref}$ is the normalized (per million) frequency of the word in the reference corpus,
$N$ is the so-called smoothing parameter ($N&space;=&space;1$ is the default value).

### Example

Your focus corpus (BNC): 112,289,776 tokens
Frequency of the lemma (shard) in the corpus: 35

Relative frequency

$fpm_{rm&space;focus}&space;=&space;\frac{number~of~hits~\cdot~1,000,000}{corpus~size}&space;=&space;\frac{35~\cdot~1,000,000}{112,289,776}&space;=&space;0.3117$

Selected reference corpus (ukWaC): 1,559,716,979 tokens
Frequency of the lemma (shard) in the corpus: 263

Relative frequency

$Score&space;=&space;\frac{fpm_{rm&space;focus}&space;+&space;N}{fpm_{rm&space;ref}&space;+&space;N}&space;=&space;\frac{0.3117&space;+&space;1}{0.1686&space;+&space;1}&space;=&space;1.1224$

#### For more details see:

Adam Kilgarriff. Simple maths for keywords. In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.

Statistic used in Sketch Engine (Chapter 5). Lexical Computing Ltd., 8 July 2015.