Sketch Engine can compare corpora in the same language by comparing attributes (usually word forms or lemmas) in the corpora. A score is computed indicating to which extent the corpora are similar or different. A score of 1 indicates identical corpora. The higher the score, the more different the corpora are.

Users can compare preloaded corpora as well as their own user corpora.

How to compare corpora

(1) go to SELECT CORPUS – Advanced tab – COMPARE CORPORA

(2) select the language and set attribute
only corpora in the same language can be compared
the attribute defines what will be compared, e.g. lemma will compare the words ignoring the word form they are in

(3) select from the preloaded, user, or shared corpora to compare

(4) the result will be displayed in a comparison chart

Understanding the result

  • the value of 1 indicates identical corpora
  • the higher the score, the greater the difference between corpora
  • the scores are clickable and connected to the relevant word list page of two selected corpora and attribute

Corpus comparison alternative

Two corpora can also be compared based on keywords and terms extracted from them. Set one corpus as the focus corpus and the other as the reference corpus.

A comparison of selected English corpora in Sketch Engine

The corpus comparison result shows that all English web corpora in Sketch Engine have very similar content, while the DOAJ corpus is notably different. The EUROPARL corpus of speeches in the European Parliament is very different from from the DOAJ corpus.

Compare corpora

Process of comparing corpora

The compare corpora method carries out the following process for every two selected corpora (an example for attribute ‘word’):

  1. Finds 5000 most frequent words from corpus1 and 5000 from corpus2.
  2. Creates one set of these words, deletes duplicated ones.
  3. Counts keyness score for every word from the set. The corpus with a higher relative frequency of the word is a focus corpus. Contrarily, the corpus with a smaller relative frequency of the word is a reference corpus. Thus, the result number is always more than 1 or 1 in case the frequency is the same.
  4. Chooses only top 500 words according to the keyness score.
  5. Makes arithmetic mean (average) of their scores. The average is the number expressing a similarity of corpora displayed in the result chart.


arithmetic mean (average) – the sum of a collection of numbers divided by the count of numbers in the collection

Kilgarriff, A. (2001). Comparing corporaInternational journal of corpus linguistics6(1), 97-133.