The word sketch processes the word’s collocates and other words in its surroundings. It can be used as a one-page summary of the word’s grammatical and collocational behaviour. The results are organized into categories, called grammatical relations, such as words that serve as an object of the verb, words that serve as a subject of the verb, words that modify the word etc. The words which will be included in the analysis are defined by rules written in the sketch grammar.

Large modern corpora can contain billions of words with thousands of instances of each word. This makes it unrealistic to look at each concordance line. The word sketch solves this problem. Words surrounding the search word are processed by the word sketch and displayed in a compact, easy to understand format and organized from the most typical to the least typical collocations. The following word sketch was generated from over 9 million instances of the word team.  It shows the search word, its frequency, its collocates sorted into grammatical relations, the frequency of each collocate, typicality score, the most frequent representation of each collocation and a local menu with links to other tools.

For practicality, use the icons in the header of each column to reorder, close and display the columns again.  Use the icons next to each collocate or in the header of the column to display the collocates in context as a concordance. To focus on only one column, use the icon which will hide all the remaining columns with one click. Hover the mouse over the icons to display tooltips which explain their functions.

By default, the word sketch is sorted with the most typical collocations at the top. This is the preferred option for most uses because what is frequent is usually not interesting or useful, but typical is. The logDice score is used for determining how typical (or how strong) the collocation is. Use view options to display the score.

A high score means that the collocate is often found together with the node and at the same time there are not very many other nodes that the collocate combines with or it does not combine with them too frequently. The bond between the node and the collocate is very strong ⇢ strong collocation.

A low score means that the collocate likes to combine with very many other words. The bond between the node and the collocate is weak ⇢ weak collocation.

It is not possible to set a universal threshold between weak and strong collocations because each word behaves differently. The main purpose of the score is to sort the collocates by their typicality or strength, not to decide whether a collocation is weak or strong.

The view options allow sorting by frequency if needed.

Please refer to Statistics used in Sketch Engine and to Lexicographer-friendly score for the formula. Here is a simplified but sufficiently informative explanation.

Referring to the screenshot above, to determine the strength of the collocation management team compared to other collocations of team, all nouns modified by management are found first. The sketch grammar for English determines which nouns surrounding management should be regarded as the modified nouns.
Then, each time management modifying team is found, management gets a plus point. Each time management is found modifying another noun, management is given a minus point. The logDice score is calculated to indicate whether there were many plus points or many minus points. The score is always presented as a positive number.

A very high score of the collocate means that there is little competition from other collocates. The node (the search word, the keyword) does not often combine with other collocates. In other words, the competitors are not frequent for either of these reasons or their combination:

  • The number of different competing collocates is very small.
  • The number of different competing collocates may be high but the frequency of each of them is low so the total stays low.

A very low score means that there is extreme competition from other collocates for either of these reasons or their combination:

  • The number of different competing collocates is very high.
  • The number of different competing collocates may be small but the frequency of each of them is extremely high which produces lots of competition for the collocate in question.

As a result, it is quite common that the most common and frequent words (new, go, be, small, very) hardly ever receive high scores as collocates because they are used so often in combination with so many other words that there is lots of competition. Exceptions can exist if the collocation is so extremely frequent and that it beats all its competitors.

The sketch grammar is a set of rules written in CQL which make use of POS tags and regular expressions to define which tokens should be included in each grammatical relation. For example, a subject may be defined as a noun before verb but the actual rule is much more complex defining more specific requirements for both the noun and verb, their relative position and compulsory and optional words between them.

The word sketch does not use any parsing information and a parsed corpus is not needed. However, sketch grammars using parsed corpora can be developed.

Instead of using the sketch grammars developed by Sketch Engine, users can develop their own and apply them to their own corpora but not to the preloaded corpora.

POS Tags and lemmas

The word sketch works with a POS-tagged and lemmatized corpus. Parsed corpus is not needed. Universal word sketches are available for corpora without tagging and/or lemmatization, see below.

The corpus has to be tagged in Sketch Engine or with the same tagset as the one used by Sketch Engine so that the tags are the same as the ones used in the word sketch grammar. A custom word sketch grammar has to be used if the corpus is tagged with a different tagset.

A word sketch can also be generated from a non-lemmatized corpus in which case each word form will be treated independently. Thus, using English as an example, a different word sketch would be produced for goes and a different one for went. Such word sketches exist only for languages where lemmatization is not supported by Sketch Engine.

Corpus size

The corpus size itself does not affect the quality of the result, what matters is the absolute frequency of the word for which the word sketch should be generated. At least a few dozen occurrences are required. However,  a minimum of a few hundred occurrences is required for a usable word sketch. To obtain a rich word sketch with lots of collocates, a few thousand occurrences are needed at least. The quality improves with each order of magnitude.

A so-called universal sketch grammar is used for corpora in languages where tagging and/or lemmatization is not available. The grammatical relations will be simplified to something like noun to the right, noun to the left, verb to the right, verb to the left, or even word to the right, word to the left. Although simplistic, they are a great help when working with large corpora and high-frequency verbs because data coming from thousands of occurrences can be reviewed quickly and easily.

Multiword Sketch — collocations with phrases

The multiword sketch is an extension of the word sketch. It processes the left and right context of a phrase and identifies the collocates of each word in the phrase. The collocations are only extracted from sentences which contain the collocation (phrase) in question. In other words, the collocates only come from contexts where the collocation (phrase) is used. Contexts where the members of the phrase are used on their own are excluded. This makes it possible to only display collocates related to a particular word sense or subject.

The multiword sketch can be accessed in two ways:

From the context menu

Locate the collocate you are interested in, click multiword sketch in the context menu.

Multiword sketch

By typing the collocation in the input form

Type the node and the collocate(s) directly in the input form. Their order is not important. Do not type words which are not a node or collocate(s), i.e. only words which would appear in the grammatical relations in the word sketch.

Multiword sketch

The input must only contain lemmas and only those that can be captured by grammatical relations, i.e. those that qualify as collocations. Typically, articles or other determiners and pronouns should not be included.

to get collocations for type
sweetened drinks sweeten drink
grab a drink grab drink
Multiword sketch of sugary drink

Multiword sketch for sugary drink

Bilingual Sketch — compare collocations in two languages

The multilingual word sketch displays a word sketch for the word and another word sketch for its translation side by side. This makes it easy to observe the collocations in both languages. This is the English – German multilingual word sketch for dog (Hund) using the English Web 2015 and German Web 2013 corpora.

The corresponding grammatical relations are aligned for easy comparison. Relations are not aligned if:

  • The same (or comparable) grammatical relation is not included in the word sketch grammar for the language.
  • The linguistic feature does not exist in both languages.

Bilingual word sketch compares collocations in two languages


Only monolingual statistics from the monolingual Sketch Engine can be displayed. The setting is in the view options visibility to display them. No multilingual statistics are available.


This tools works with any corpora. A parallel corpus is not required.

How to use

  • select the first corpus and go to Word Sketch
  • select the ADVANCED tab
  • tick Translate and select the corpus in the second language, type the lemma and, if needed, select the part of speech.
  • click GO