Clustering | Sketch Engine

Clustering can be performed in Sketch Engine on

the similar words in a Thesaurus
the collocates in a Word sketch

If the clustering option is selected then the similar words from the thesaurus are clustered according to their distributional similarity scores. The distributional similarity score is provided in section 3 of our documentation Statistics used in the Sketch Engine. The algorithm is a greedy and agglomerative. All pairs of words are listed in order of their distributional similarity. The sorted list is processed in decreasing order, merging a word into a cluster so far formed provided that the distributional similarity with it and any word in the cluster is greater than the specified threshold similarity and that this value is higher than the equivalent from the other clusters so far formed.

The collocates within a word sketch are clustered according to any such clusters from the distributional thesaurus that they appear in.

Related paper

The third section in Statistics used in the Sketch Engine

Related paper

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine