Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n‑grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Sketch Engine is also suitable for comparing corpora. Users can compare preloaded corpora as well as their own corpora that were compiled. The result of comparing corpora is a comparison chart.

How to compare corpora

Compare corpora in four simple steps.

(1) click Compare corpora in the Main menu

(2) select language and set attribute for the comparison

(3) select by ticking two or more corpora to comparing

(4) see the result in the comparison chart

Characteristics of the result

  • value “1” means identical corpora
  • the higher score & the darker color, the greater difference between corpora (“4” does not mean twice as many as “2”)
  • the scores are clickable and connected to the relevant word list page of two selected corpora and attribute

How does it work?

Process of comparing corpora

– for every two corpora
– top 5000 words according to frequency (from every corpus separately),
– for every word from unification (each word of every pair of top 5000 words) to count keyword score (for each corpus separately),
– next only top 500 words according to score (only highest values – positive or negative scores)
– arithmetic mean (average) of their score is a similarity pair of corpora

Glossary

unification – collection of two or more sets, e.g. union of 2 sets A and B is the set of elements which are in A, in B, or in both A and B.

arithmetic mean (average) – the sum of a collection of numbers divided by the count of numbers in the collection

Another possibility to compare two corpora

The second way of comparing corpora is via the Word list feature which enables to compare two corpora (or their subcorpora) and set significance of rare/common words.

A comparison chart for English corpora

The picture shows a comparison of various English corpora. The scores in the table stand for corpus similarity when 1 is for identical corpora and the bigger the score (and the darker the grey), the greater the difference between two corpora. The corpora written on lines are compared corpora, in columns, there are reference corpora

Bibliography

Kilgarriff, A. (2001). Comparing corporaInternational journal of corpus linguistics6(1), 97-133.