-
deduplication
Deduplication is a process of removing duplicated content from a corpus. Only the first instance of the text is preserved, any subsequent (duplicated) occurrences are removed. Deduplication is especially important with corpora built by crawling the web. This is because lots of web content is reposted and shared to other locations. Including the same content multiple times would skew the statistics of the real-life use of the language. In real life, the content was written only once, not multiple times. Therefore, it should be counted (and included in the corpus) only once. Deduplication can be carried out at different levels. In Sketch Engine, deduplication is typically carried out at the paragraph level. If the same paragraph is found elsewhere in the corpus, the 2nd and subsequent occurrences are removed. As an example, a news site article which is published on two websites belonging to the same company may share certain paragraph. Deduplication will remove the shared paragraphs from one of the articles, making the article incomplete. This is in the interest of preserving the true frequency of use information. The deduplication in Sketch Engine is designed to deduplicate identical content as well as content which is almost identical despite some minimal differences. Users can turn off deduplication for their own user corpora if it is important that duplicated content should be preserved. See also Build a corpus from the web (preloaded corpora) Build your own corpus from the web (user corpus) Build corpus by uploading data (user corpus) -
disambiguation
a process of identifying meanings of words (lemma, part of speech) when a word has multiple meanings. The result of this process is one word with one meaning. -
distributional thesaurus [ feature ]
an automatically produced thesaurus which identifies words that occur in similar contexts as the target word. It draws on the hypothesis of distributional semantics. The automatically produced thesaurus is available for each word in the corpus. more about automatic thesaurus The distributional thesaurus in Sketch Engine is available for every language and corpus that supports word sketches. Refer to user manual to learn to generate the thesaurus. -
document
A document (called a file in old corpora) is a generic name used in Sketch Engine to refer to any file, document or webpage the corpus is made up of. If a user uploads a file (such as .doc, .pdf, .txt), each of the files becomes a corpus document. If the user downloads content from the web, each web page becomes a corpus document. The beginning and end of each document is automatically marked with a structure, most typically with <doc></doc> but certain corpora may use a different convention such as British National Corpus which uses <bncdoc></bncdoc>. This can be checked on the corpus info page. A corpus can also be divided into documents by manually inserting document structures into the source text. see Corpus annotation -
document frequency (docf) [ statistics ]
The document frequency is the number of documents in which the word or phrase appears. If the corpus has 100 documents and 2 documents contain the word city: document number 7 contains 17 instances of city, document number 31 contains 6 instances of city, the document frequency of city is 2, because 2 documents contain the word. It is not important how many documents the corpus contains or how many times the word appears in total. The document frequency can be better suited for comparison in situations when the corpus contains a small number of documents with an extremely high frequency of particular words. Relative document frequency (also relative DOCF) is the percentage of documents that contain the word or item. Similar to the relative frequency, it is used to compare document frequencies between corpora of different sizes. see also frequency frequency per million ARF Statistics used in Sketch Engine