Most frequent or most typical collocations – which is more useful?

Word sketches in Sketch Engine are one-page summaries of word combinations (called collocations) that the word prefers. These summaries are computed automatically based on a sample of language of billions of words called a text corpus.

Looking at the page gives the user an instant idea of how the word is typically used, in which context it appears most frequently and what the typical word combinations are. The user can also go directly to the actual sentences from which the information was extracted to check the context in detail.

An example of a word sketch might look like this:

Screenshot of word sketch from the British National Corpus (BNC)

The combinations are divided into categories such as modifiers, verbs, objects or subjects of a verb etc.

Apparently, each word can form more word combinations than those displayed in a word sketch by default. So how does Sketch Engine determine which collocations should be displayed? Where is the cut-off line? Users generally assume that this happens on the basis of frequency and that the collocations at the top of the list are the most frequent collocations. This would be, in most cases, not very useful as we will see further below. Sketch Engine takes a different approach and focuses on the typicality (or strength of collocation) rather than frequency of use.

What is the difference between frequency and typicality?

Frequency (weak collocations)

Surprisingly, the fact that a word combination is frequent is often of limited use or even insignificant in terms of language teaching/learning or language research. For example, here are the most frequent collocations of the word bedroom (only adjectives modifying the noun are included)

small
own
spare
twin
front
main
comfortable
big
large

Looking at the list, one notices that most of the words are very predictable. In other words, if a student of English wants to speak about a bedroom of a small size, they will naturally use the word small. They will not usually need to consult a dictionary to make sure that small is a suitable word combination. Similarly, when teaching bedroom as a new word, it is not useful to point the student to collocations such as small, own, big or comfortable because they are quite predictable. The collocations in this list would be classified as weak collocations.

Typicality (strong collocations)

On the other hand, typicality refers to collocations useful for learning or teaching or for inclusion in a dictionary. Typicality focuses on collocations which are not (completely) predictable. An example of such a collocation from the list above is twin bedroom. A collocation list for bedroom ordered by the typicality score will look quite different with these items at the top:


master
double
spacious
spare
en-suite
upstairs
twin
guest
air-conditioned

This list is more useful for language learning and more interesting for linguists and lexicographers. It all depends on the language level, of course, and the first list might be actually of some use to beginners but it is the second list that we would expect to see when we want to learn how word bedroom is used in English.

How does the software do it?

There is a very complex and sophisticated algorithm behind word sketches that identifies collocations and calculates the collocation score (logDice) to decide whether the collocation will be included in the word sketch. To get a rough understanding of how these collocations are identified, you can imagine the process as follows:

First, the algorithm identifies all instances of adjective + bedroom combinations in the corpus. Then it takes the adjective and looks for all small + noun combinations in the corpus. Each time small is found together with bedroom, it gets a plus point and each time small is found in combination with another noun it gets a minus point. (The actual algorithm is more complex but even this simplification is sufficiently illustrative.) As a result, the algorithm will classify collocations like this:

  • adjectives that tend to combine with a large selection of other words, i.e. are very flexible in their use, will result as weak collocations and will not be generally included in the word sketch
  • adjectives that only combine with one or a handful of nouns (they ‘specialize’ in combining with certain nouns only) will result as strong collocations and will be included in the word sketch
  • even collocations composed of frequent words such as small print will be included because the noun print does not combine with too many other adjectives so there is not too much competition for small

By default, the collocates in a word sketch will be sorted by the score and the top 25 items will be displayed. The user can change this limit and also switch to sorting by the frequency which will put the less typical (and, in language teaching terminology, less advanced items) at the top.

How to analyse collocations in the British National Corpus (BNC)

Learn to work with collocations in Sketch Engine in 4 minutes.

Topic classification
corpus from the web
blog: pos tags

POS tags

OneClick Terms - multi-word term extraction
Screenshot of thesaurus from esTenTen Spanish corpus

Automatic thesaurus