Why is there a different frequency per million between making a query with chosen specific text types and in a subcorpus that has these same text types?

The reason is different counting of frequency in both cases. In the first case, we divide the frequency by the whole corpus. In contrast to the second one, where making queries in a subcorpus is only divided by the size of this subcorpus.

For instance, searching in British National Corpus that contains 112,181,015 tokens.

  • Search [word=”rainbow”] in the CQL form with a restriction on publication date 1960–1974 in text types. The frequency per million is 0,05. This is the result of multiple of the number of occurrences * million divided by the size of the corpus (6 * 1,000,000 / 112,181,015).


  • Make the same query in the CQL form, but chosen subcorpus 1960–1974, its size is 2,074,244 tokens. Now we can estimate that the frequency must be different because the size of the corpus (in this case the subcorpus) is more than 50 times smaller. Therefore, frequency per million is 2,89 due to the calculation (6 * 1,000,000 / 2,074,244).