Process of comparing corpora
– for every two corpora
– top 5000 words according to frequency (from every corpus separately),
– for every word from unification (each word of every pair of top 5000 words) to count keyword score (for each corpus separately),
– next only top 500 words according to score (only highest values – positive or negative scores)
– arithmetic mean (average) of their score is a similarity pair of corpora
unification – collection of two or more sets, e.g. union of 2 sets A and B is the set of elements which are in A, in B, or in both A and B.
arithmetic mean (average) – the sum of a collection of numbers divided by the count of numbers in the collection