Sketch Engine can compare corpora in the same language by comparing attributes (usually word forms or lemmas) in the corpora. A score is computed indicating to which extent the corpora are similar or different. A score of 1 indicates identical corpora. The higher the score, the more different the corpora are.
How to compare corpora
(2) select the language and set the attribute
only corpora in the same language can be compared
the attribute defines how to compare, e.g. lemma will ignore the different word forms of the same word
(3) select from the preloaded, user, or shared corpora to compare
(4) the result will be displayed in a comparison chart
Understanding the result
- The value of 1 indicates identical corpora.
- The higher the score, the greater the difference between corpora.
- It is not possible to tell which value indicates a small difference and which value indicates a big difference. The sore can only be used for comparing differences.
- The score does not give clues to what exactly is different between the corpora. Since the comparison is done on tokens, the score is not affected by sentence length, number of documents, corpus size or grammatical features.
- The scores are clickable and connected to the relevant word lists of the two selected corpora.
Corpus comparison alternative
Two corpora can also be compared based on keywords and terms extracted from them. Set one corpus as the focus corpus and the other as the reference corpus.
A comparison of selected English corpora in Sketch Engine
The corpus comparison result shows that all English web corpora in Sketch Engine have very similar content, while the DOAJ corpus is notably different. The EUROPARL corpus of speeches in the European Parliament is very different from the DOAJ corpus.
How does comparing work?
The following process is used to compare every two corpora in the selection. Corpora can be compared using any attribute. This example uses attribute: word.
- Sketch Engine identifies 5,000 most frequent words in corpus1 and 5,000 in corpus2.
- The two lists are combined into one and duplicates are removed so that each word only appears once.
- The keyness score for every word is computed. The corpus with a higher relative frequency of the word is set as the focus corpus. The corpus with a smaller relative frequency of the word is used as the reference corpus. Thus, the resulting number is always more than 1 or 1 in case the frequency is the same.
- The 500 words with the highest keyness score are identified.
- The arithmetic mean (average) is calculated from the keyness scores of the top 500 words. The result expresses the similarity of the corpora. This is the number displayed in the chart on the corpus comparison result screen.
arithmetic mean (average) – the sum of a collection of numbers divided by the count of numbers in the collection
Kilgarriff, A. (2001). Comparing corpora. International journal of corpus linguistics, 6(1), 97-133.