Thesaurus — synonyms, antonyms and similar words
The thesaurus in Sketch Engine is an automatically generated list of synonyms or words belonging to the same category (semantic field). The list is produced based on the context in which the words appear in the selected text corpus. Only nouns, adjectives, verbs and adverbs are supported in most corpora.
change search criteria
display or hide scores, frequency, activate clustering
visualisation – display collocations as diagram
favourites – bookmark this word sketch for easy access
change the part of speech
use this word as the search word for other tools
How to use the thesaurus
Visit the related Quick start guide or watch this video.
Hover the mouse over icons, controls and other elements to display the tooltips. Click the highlighted words the functions and settings.
What makes the thesaurus unique?
Because no manual work is involved, the synonym lists can be generated for any word in the language provided a sufficient number of occurrences is found in the corpus. This is why synonym lists can be generated even for rare words which would not be included in traditional thesauri.
How are synonyms identified?
Synonyms are identified automatically based on the context in which they occur. This draws on the theory of distributional semantics which says, in a nutshell, that words that appear in the same context are similar in meaning. In Sketch Engine, this means that words which keep similar collocations are similar in meaning. The word sketch is key in determining the similarity. To determine synonyms of the search word, the word sketches of all words with the same part of speech are compared and those that share the largest proportion of collocates are listed as similar words. The score given for each synonym indicates the percentage of shared collocates.
The thesaurus quality is heavily dependent on rich word sketches containing lots of collocates which is consequently dependent on a high frequency of the search word as well as high frequency of the potential synonyms. This means that a very large corpus is needed. A size of around 100,000 words is the bare minimum to produce some usable result for high-frequency words. However, a much larger corpus is needed for rare words to ensure sufficient frequency. The use of our multi-billion word corpora is highly recommended.
The synonym list may contain words which should not be included. This is a result of automatic processing. Sketch Engine cannot determine the similarity in meaning directly, it can only compare the collocates. If the two words share the same collocates, they will be listed as synonyms even though the meaning is not similar. Such occasional inaccuracies do not make the tool less useful. To avoid this, use a larger corpus. Thesaurus for extremely rare words (frequency of just a few hundred words or less) will inevitably produce poor results or may not produce the thesaurus at all.
How is the score calculated and interpreted?
The score, which can be displayed using view options, is simply a percentage of the shared collocates. To establish this, the word sketch of the search word is compared to the word sketches of all other words in the corpus with the same part of speech. Each grammatical relation is compared separately. Please refer to Statistics used in Sketch Engine for the formula and details.
Requirements for the thesaurus to work well
Thesaurus can only work if word sketches exist in the corpus. The corpus has to be tagged in Sketch Engine or using the same tagset. A custom word sketch grammar has to be used if the corpus is tagged with a different tagset.
Thesaurus will work even with universal sketch grammars with all the related limitations. See word sketch.
Tags and lemmas
A tagged and lemmatized corpus is required for a full-fledged thesaurus. Thesauri generated from untagged and non-lemmatized corpora with universal word sketches will suffer in quality. Yet they can be very useful, especially with less-resourced languages where tagging and lemmatization are not realistic.
The quality of the thesaurus is entirely dependent on rich word sketches. A large number of collocates needs to be found for the search word but also for all other words with the same part of speech so that they can be compared. By a rich word sketch we mean a large number of collocations in all grammatical relations. This requirement can only be met if the word has a high frequency in the corpus, ideally thousands of occurrences or more. Consequently, a very large corpus is needed so that even less frequent words can produce rich word sketches.