Sketch Engine can compare corpora in the same language by comparing attributes (usually word forms or lemmas) in the corpora. A score is computed indicating to which extent the corpora are similar or different. A score of 1 indicates identical corpora. The higher the score, the more different the corpora are.

Users can compare preloaded corpora as well as their own user corpora.

Sketch Engine can compare corpora in the same language by comparing attributes (usually word forms or lemmas) in the corpora. A score is computed indicating to which extent the corpora are similar or different. A score of 1 indicates identical corpora. The higher the score, the more different the corpora are.

Users can compare preloaded corpora as well as their own user corpora.

How to compare corpora

(1) go to SELECT CORPUS – Advanced tab – COMPARE CORPORA

(2) select the language and set attribute
only corpora in the same language can be compared
the attribute defines what will be compared, e.g. lemma will compare the words ignoring the word form they are in

(3) select from the preloaded, user, or shared corpora to compare

(4) the result will be displayed in a comparison chart

Understanding the result

  • the value of 1 indicates identical corpora
  • the higher score, the greater the difference between corpora
  • the scores are clickable and connected to the relevant word list page of two selected corpora and attribute

Corpus comparison alternative

Two corpora can also be compared based on keywords and terms extracted from them. Set one corpus as the focus corpus and the other as the reference corpus.

A comparison of selected English corpora in Sketch Engine

The corpus comarison result shows that all English web corpora in Sketch Engine have very similar content, while the DOAJ corpus is notably different. The EUROPARL corpus of speeches in the European Parliament is very different from from the DOAJ corpus.

Compare corpora

Process of comparing corpora

The compare corpora method carries out the following process for every two selected corpora:

  1. Finds top 5000 words according to frequency (from every corpus separately).
  2. Counts keyword score (for each corpus separately) for every word from unification (each word of every pair of top 5000 words).
  3. Chooses only top 500 words according to score (only highest values – positive or negative scores are considered).
  4. Makes arithmetic mean (average) of their scores. This average is the number expressing a similarity pair of corpora displayed in the result chart.

Glossary

unification – collection of two or more sets, e.g. union of 2 sets A and B is the set of elements which are in A, in B, or in both A and B.

arithmetic mean (average) – the sum of a collection of numbers divided by the count of numbers in the collection

Kilgarriff, A. (2001). Comparing corporaInternational journal of corpus linguistics6(1), 97-133.