Sketch Engine is the ultimate tool to explore how language works. Its algorithms analyze authentic texts of billions of words (text corpora) to identify instantly what is typical in language and what is rare, unusual or emerging usage.Sketch Engine is used by linguists, lexicographers, translators, students, and teachers.

Sketch Engine is used by linguists, lexicographers, translators, students, and teachers. Its functions are based on mathematical and statistical computations which enable users to accurately search and filter queries in language corpora.

Download this page as PDF.

Statistics used in Sketch Engine statistics used in sketch engine

1 General reference

Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubícek, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel (2014): The Sketch Engine: ten years on. In Lexicography 1(1): 7–36. DOI: 10.1007/s40607-014-0009-9. ISSN 2197-4292

2 Conventions

This document describes statistics used in the Sketch Engine system. Following conventions apply unless specified otherwise:

N – corpus size,

f_A – number of occurrences of the keyword in the whole corpus (the size of the concordance),

f_B – number of occurrences of the collocate in the whole corpus,

f_{AB} – number of occurrences of the collocate in the concordance (number of co-occurrences)

2.1 With grammatical relations

Terminology follows Dekang Lin, ACL-COLING 1998: “Automatic Retrieval and Clustering of Similar Words.”
We count frequencies for triples of a first word connected by a specific grammatical relation to a second word, written (word1, gramrel, word2)

||w_1, R, w_2|| – number of occurrences of the triple,

||w_1, R, ast|| – number of occurrences of the first word in the grammatical relation with any second word

||ast, ast, w_2|| – number of occurrences of the second word in any grammatical relation with any first word

||ast, ast, ast|| – number of occurrences of any first word in any grammatical relation with any second word: that is, the total number of triples found in the corpus.

3 Word Sketches

Until September 2006 we used a version of MI-Score modified to give greater weight to the frequency of the collocation defined as:


also see MI Score


Association score

AScore(w_1, R, w_2) = logfrac{||w_1, R, w_2]|| cdot ||ast, ast, ast||}{||w_1, R, ast|| cdot ||ast, ast, w_2||} cdot log(||w_1, R, w_2|| + 1)

Since September 2006, noting the scale-dependency of AScore and recent relevant research including Curran 2004 “From Distributional to Semantic Similarity” (PhD Thesis, Edinburgh Univ) we changed the statistic to logDice, based on the Dice coefficient:


Dice(f_A, f_B) = frac{2frac{f_A}{N}frac{f_B}{N}}{frac{f_A}{N} + frac{f_B}{N}}simeq frac{2frac{f_{AB}}{N}}{frac{f_A}{N} + frac{f_B}{N}} = frac{2f_{AB}}{f_A + f_B}


14 + log_2 Dice Big(frac{||w_1,R,w_2||}{||w_1,R,ast||}, frac{||w_1,R,w_2||}{||ast,ast,w_2||}Big) = 14 + log_2 frac{2 cdot ||w_1, R, w_2||}{||w_1,R,ast|| + ||ast,ast,w_2||}

For more information on logDice, see: Rychlý, P. (2008). A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, pp. 6–9.

Since June 2015 (word sketch format 4, Manatee version 2.125) the indices were modified so that the score is (more correctly) computed as follows:

logDice general word sketch score (applies in all cases except those listed below)

14 + log_2 Dice Big(frac{||w_1,R,w_2||}{||w_1,R,ast||}, frac{||w_1,R,w_2||}{||ast,R,w_2||}Big) = 14 + log_2 frac{2 cdot ||w_1, R, w_2||}{||w_1,R,ast|| + ||ast,R,w_2||}

score for word sketch triples of UNARY grammatical relations


score for a given grammatical relation R as such


score for word sketch display with unified grammatical relations

14 + log_2 Dice Big(frac{||w_1,R,w_2||}{f_w_1}, frac{||w_1,R,w_2||}{f_w_2}Big) = 14 + log_2 frac{2 cdot ||w_1, R, w_2||}{f_w_1 + f_w_2}

For example, the score of management for the word sketch team (as a noun) in the BNC corpus is equal to 9.31 (see this word sketch, login required), and it is computed: 14 + log_2 frac{2 cdot 433}{13919 + 8314}

Where 433 means the number of cooccurrences for the relation “management as modifier of team” (see);  13919 is the CQL query lc [ws("team-n", "modifiers of "%w"", ".*")] (see); 8314 is the CQL query lc [ws(".*", "modifiers of "%w"", "management-n")] (see).

See the computed result on Google.

4 Thesaurus

To compute a similarity score between word w_1 and word w_2, we compare w_1 and w_2’s word sketches in this way:

  • find all the overlaps, i. e. where w_1 and w_2 share a collocation in the same grammatical relation,
    e. g.: (beer wine, OBJECT_OF, drink), where the association score > 0,
  • let ws_{w1} and ws_{w2} be the set of all word sketch triples (headword, relation, collocation) for w_1
    and w_2, respectively, where the association score > 0,
  • let ctx(w_1) = {(r, c)|(w_1, r, c) in ws_{w1} },
  • let ASi be the association score of a word sketch triple (since September 2006, logDice is used),
  • then the distance between w_1 and w_2 is computed as:

Dist(w_1, w_2) = frac{sum_{(r,c)in ctx(w_1)cap ctx(w_2)} AS_{(w_1,r,c)} + (AS_{(w_2,r,c)} - (AS_{(w_1,r,c)} - AS_{(w_2,r,c)})^2/50}{sum_{iin ws_1} AS_i + sum_{iin ws_2} AS_i}

The term (AS_i - AS_j )^2 /50 is subtracted in order to give less weight to shared triples, where the triple is far more salient with w1 than w2 or vice versa. We find that this contributes to more readily interpretable results, where words of similar frequency are more often identified as near neighbours of each other.
The constant 50 can be changed using the -k option of the mkthes command.

5 Key words, key terms, comparing corpora

Key words are words typical of a focus corpus (a corpus we are interested in) in contrast to a reference corpus (usually a general corpus in the same language as the focus corpus).
The keyness score of a word is calculated according to the following formula:

frac{f_{pm_{focus}} + n}{f_{pm_{ref}} + n}

where f_{pm_{focus}} is the normalized (per million) frequency of the word in the focus corpus, {f_{pm_{ref}} is the normalized (per million) frequency of the word in the reference corpus, n is the simple math (smoothing) parameter (n = 1 is the default value).

The top key words reflect the domain of the focus corpus very well and can be used to explore differences between corpora in Sketch Engine as shown in Kilgarriff: “Getting to know your corpus”. In Proceedings of Text, Speech and Dialogue 2012, Lecture Notes in Computer Science. Springer, 2012.

Key terms are multi-word noun phrases typical of a corpus. They are defined using term definition rules (similarly to word sketch relations). The keyness score for terms is the same as for words, corpus frequencies of whole term phrases are taken into account in this case.

6 Other statistics

These are the statistics offered under the “collocations” function accessible from the concordance window; these statistics do not involve grammatical relations.


also see T-score

frac{f_{AB} - frac{f_Af_B}{N}}{sqrt{f_{AB}}}


also see MI Score

log_2 frac{f_{AB}N}{f_Af_B}

Church and Hanks, Word Association Norms, Mutual Information, and Lexicography, in Computational Linguistics, 16(1):22-29, 1990


log_2 frac{f^3_{AB}N}{f_Af_B}

Oakes, Statistics for Corpus Linguistics, 1998


2 cdot (xlx(f_{AB}) + xlx(f_A - f_{AB}) + xlx(f_B - f_{AB}) + xlx(N) \ +xlx(N + f_{AB} - f_A - f_B) - xlx(f_A) - xlx(f_B) - xlx(N - f_A) - xlx(N - f_B))

where xlx(f) is f  ln(f)

Dunning, Accurate Methods for the Statistics of Surprise and Coincidence, in Computational Linguistics, 19:1 1993

minimum sensitivity

min (frac{f_{AB}}{f_B},frac{f_{AB}}{f_A})

Pedersen, Dependent Bigram Identification, in Proc. Fifteenth National Conference on Artificial Intelligence, 1998

MI.log-f (formerly called salience)

text{MI-Score} cdot ln (f_{AB} + 1)

Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). Itri-04-08 The sketch engine​Information Technology105, 116.


frac{2 cdot f_{AB}}{f_A + f_B}


14 + log_2 frac{2 cdot f_{AB}}{f_A + f_B}

relative freq

frac{f_{AB}}{f_A} cdot 100

Explore Sketch Engine

Find out whether Sketch Engine is an appropriate tool for your tasks.