How does comparing work?
The following process is used to compare each two corpora in the selection. Corpora can be compared using any attribute. This example uses attribute: word.
- Sketch Engine identifies 5,000 most frequent words in corpus1 and 5,000 in corpus2.
- The two lists are combined into one and duplicates are removed so that each word only appears once.
- The keyness score for every word is computed. The corpus with a higher relative frequency of the word is set as the focus corpus. The corpus with a smaller relative frequency of the word is used as the reference corpus. Thus, the result number is always more than 1 or 1 in case the frequency is the same.
- The 500 words with the highest keyness score are identified.
- Arithmetic mean (average) is calculated from the keyness scores of the top 500 words. The result expresses the similarity of the corpora. This is the number displayed in the chart on the corpus comparison result screen.
arithmetic mean (average) – the sum of a collection of numbers divided by the count of numbers in the collection
Kilgarriff, A. (2001). Comparing corpora. International journal of corpus linguistics, 6(1), 97-133.