Word Sketch — collocations and word combinations

The word sketch processes the word’s collocates and other words in its surroundings. It can be used as a one-page summary of the word’s grammatical and collocational behaviour. The results are organized into categories, called grammatical relations, such as words that serve as an object of the verb, words that serve as a subject of the verb, words that modify the word etc. The words which will be included in the analysis are defined by rules written in the sketch grammar.

1
2
3
4
5
6
7
8
9
10
11
word sketch - collocations of team
1

change search criteria

2

download results

3

display or hide scores, frequency, change sorting, activate clustering

4

visualisation – display collocations as diagram

5

favourites – bookmark this word sketch for easy access

6

move this column to a different position

7

display all sentences or examples from which collocations in this column were collected

8

only keep this column and hide the others

9

hide this column

10

change the part of speech

11

use this word as the search word for other tools

How to use the word sketch

Visit the related Quick start guide or watch this video.

Hover the mouse over icons, controls and other elements to display the tooltips. Click the highlighted words to learn about the functions and settings.

Word sketch tooltip

What makes the word sketches unique?

Large modern corpora can contain billions of words with thousands of instances of each word. This makes it unrealistic to look at each concordance line. The word sketch solves this problem. Words surrounding the search word are processed by the word Sketch Engine and displayed in a compact, easy to understand format and organized from the most typical to the least typical collocations. The following word sketch was generated from over 9 million instances of the word team.  It shows the search word, its frequency, its collocates sorted into grammatical relations, the frequency of each collocate, typicality score, the most frequent representation of each collocation and a local menu with links to other tools.

Display this word sketch in Sketch Engine (login required)
Display a similar word sketch in an open corpus (no login required)

word sketch - collocations of team

Working with the columns

For practicality, use the icons in the header of each column to reorder, close and display the columns again.  Use the icons next to each collocate or in the header of the column to display the collocates in context as a concordance. To focus on only one column, use the icon which will hide all the remaining columns with one click. Hover the mouse over the icons to display tooltips which explain their functions.

Sorted by the score (typicality)

By default, the word sketch is sorted with the most typical collocations at the top. This is the preferred option for most uses because what is frequent is usually not interesting or useful, but typical is. The logDice score is used for determining how typical (or how strong) the collocation is. Use view options to display the score.

A high score means that the collocate is often found together with the node and at the same time there are not very many other nodes that the collocate combines with or it does not combine with them too frequently. The bond between the node and the collocate is very strong ⇢ strong collocation.

A low score means that the collocate likes to combine with very many other words. The bond between the node and the collocate is weak ⇢ weak collocation.

It is not possible to set a universal threshold between weak and strong collocations because each word behaves differently. The main purpose of the score is to sort the collocates by their typicality or strength, not to decide whether a collocation is weak or strong.

The view options allow sorting by frequency if needed.

How is the score computed

Please refer to Statistics used in Sketch Engine and to Lexicographer-friendly score for the formula. Here is a simplified but sufficiently informative explanation.

Referring to the screenshot above, to determine the strength of the collocation management team compared to other collocations of team, all nouns modified by management are found first. The sketch grammar for English determines which nouns surrounding management should be regarded as the modified nouns.
Then, each time management modifying team is found, management gets a plus point. Each time management is found modifying another noun, management is given a minus point. The logDice score is calculated to indicate whether there were many plus points or many minus points. The score is always presented as a positive number.

Interpreting the score

A very high score of the collocate means that there is little competition from other collocates. The node (the search word, the keyword) does not often combine with other collocates. In other words, the competitors are not frequent for either of these reasons or their combination:

  • The number of different competing collocates is very small.
  • The number of different competing collocates may be high but the frequency of each of them is low so the total stays low.

A very low score means that there is extreme competition from other collocates for either of these reasons or their combination:

  • The number of different competing collocates is very high.
  • The number of different competing collocates may be small but the frequency of each of them is extremely high which produces lots of competition for the collocate in question.

As a result, it is quite common that the most common and frequent words (new, go, be, small, very) hardly ever receive high scores as collocates because they are used so often in combination with so many other words that there is lots of competition. Exceptions can exist if the collocation is so extremely frequent and that it beats all its competitors.

Sketch grammar

The sketch grammar is a set of rules written in CQL which make use of POS tags and regular expressions to define which tokens should be included in each grammatical relation. For example, a subject may be defined as a noun before verb but the actual rule is much more complex defining more specific requirements for both the noun and verb, their relative position and compulsory and optional words between them. It will also contain conditions to clean content captured accidentally by the preceding rules.

The word sketch does not use any parsing information and a parsed corpus is not needed. However, sketch grammars using parsed corpora can be developed.

Instead of using the sketch grammars developed by Sketch Engine, users can develop their own and apply them to their own corpora.

Requirements for the word sketch to work well

POS Tags and lemmas

The word sketch works with a POS-tagged and lemmatized corpus. Parsed corpus is not needed. Universal word sketches are available for corpora without tagging and/or lemmatization, see below.

The corpus has to be tagged in Sketch Engine or with the same tagset as the one used by Sketch Engine so that the tags are the same as the ones used in the word sketch grammar. A custom word sketch grammar has to be used if the corpus is tagged with a different tagset.

A word sketch can also be generated from a non-lemmatized corpus in which case each word form will be treated independently. Thus, using English as an example, a different word sketch would be produced for goes and a different one for went. Such word sketches exist only for languages where lemmatization is not supported by Sketch Engine.

Corpus size

The corpus size itself does not affect the quality of the result, what matters is the absolute frequency of the word for which the word sketch should be generated. At least a few dozen occurrences are required. However,  a minimum of a few hundred occurrences is required for a usable word sketch. To obtain a rich word sketch with lots of collocates, a few thousand occurrences are needed at least. The quality improves with each order of magnitude.

Universal sketch grammar

A so-called universal sketch grammar is used for corpora in languages where tagging and/or lemmatization is not available. The grammatical relations will be simplified to something like noun to the right, noun to the left, verb to the right, verb to the left, or even word to the right, word to the left. Although simplistic, they are a great help when working with large corpora and high-frequency verbs because data coming from thousands of occurrences can be reviewed quickly and easily.