Distributional thesaurus

For details about how the distributional thesaurus works, please refer to this blog post: Automatic thesaurus

Why use the distributional thesaurus?

The distributional thesaurus has clear advantages over a printed thesaurus:

  • can be generated for any word, is not limited by space
  • can produce unlimited number of synonyms
  • gives direct access to the word comparison
  • gives direct access to examples of use
  • can be generated from a specific corpus to give specific results
  • can highlight seemingly unrelated words with a similar collocational behaviour


The quality of the automatically generated thesaurus is heavily dependent on the size of the corpus. Small corpora will produce lower quality results.

Distributional thesaurus created form a language corpus

Distributional thesaurus generated from a corpus and showing synonyms in a word cloud

How to generate a thesaurus?

Log in to Sketch Engine (or click Home in the left menu) and select a corpus.

Select a corpus

Corpus selection

  • type a lemma
  • select the part of speech (leave to auto to have Sketch Engine select the most frequent part of speech)
  • click Show similar words

Automatic thesaurus - basic settings

The header shows the lemma, corpus name and the frequency of the lemma in the corpus. Clicking the frequency will bring up the concordance.

The list of synonyms is a list if lemmas ordered by Score (=similarity). The lemmas are clickable and will bring up a Word Sketch Difference. Score is a measure of similarity, Freq is the frequency of the lemma.

The word cloud is clickable. Click any word to bring up a Word Sketch Difference.

Distributional thesaurus created form a language corpus

Distributional thesaurus generated from a corpus and showing synonyms in a word cloud


A distributional thesaurus is an automatically produced thesaurus which finds words that tend to occur in similar contexts as the target word. It is not a manually constructed thesaurus of synonyms.

The left menu of the word sketch result screen gives these options:

saves the word sketch as txt or XML file

Change options
opens the advanced setting dialogue (described on this page) to change options

will cluster collocates by meaning, collocates similar in meaning will be grouped together

Sort by freq/score
will toggle the way the collocates are sorted: by frequency or by the strength of the collocation (score)

Hide/Show gramrels
show gramrels
collocates are categorized into groups
hide gramrels collocates are displayed as one long list with grammatical relation, frequency and score listed

More data
will load more collocates, the columns will contain more items

Less data
will load fewer collocates, the columns will contain fewer items

Automatic thesaurus - basic settingsBasic options are sufficient for most uses, however, the user can set these advanced options:

  • maximum number of items
    how many synonyms should be displayed
  • minimum score
    minimum score sets the minimal similarity, higher numbers produce fewer more similar results
  • cluster items
    when ticked, synonyms are clustered (grouped) based on their similarity in meaning
  • minimum similarity
    when cluster items are ticked, this number sets how similar the words need to be to be grouped, a higher number produces smaller groups of words which are closer in meaning

The statistics used in Sketch Engine to calculate the Thesaurus is described in this document.


An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments). In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Czech Republic, June 2007, pp. 41–44.