deduplication

Deduplication is a process of removing duplicated content from a corpus. Only the first instance of the text is preserved, any subsequent (duplicated) occurrences are removed.

Deduplication is especially important with corpora built by crawling the web. This is because lots of web content is reposted and shared to other locations. Including the same content multiple times would skew the statistics of the real-life use of the language.  In real life, the content was written only once, not multiple times. Therefore, it should be counted (and included in the corpus) only once.

Deduplication can be carried out at different levels. In Sketch Engine, deduplication is typically carried out at the paragraph level. If the same paragraph is found elsewhere in the corpus, the 2nd and subsequent occurrences are removed. As an example, a news site article which is published on two websites belonging to the same company may share certain paragraph. Deduplication will remove the shared paragraphs from one of the articles, making the article incomplete. This is in the interest of preserving the true frequency of use information.

The deduplication in Sketch Engine is designed to deduplicate identical content as well as content which is almost identical despite some minimal differences.

Users can turn off deduplication for their own user corpora if it is important that duplicated content should be preserved.

See also

Build a corpus from the web (preloaded corpora)
Build your own corpus from the web (user corpus)
Build corpus by uploading data (user corpus)