• deduplication

    Deduplication is a process of removing duplicated content from a corpus. Only the first instance of the text is preserved, any subsequent (duplicated) occurrences are removed. (more…)
  • disambiguation

    a process of identifying meanings of words (lemma, part of speech) when a word has multiple meanings. The result of this process is one word with one meaning.
  • distributional thesaurus [ feature ]

    an automatically produced thesaurus which identifies words that occur in similar contexts as the target word. It draws on the theory of distributional semantics. (more…)
  • document

    A document (called a file in old corpora) in Sketch Engine refers to any file, document or webpage the corpus is made up of. If a user uploads a file (such as .doc, .pdf, .txt), each of the files becomes a corpus document. If the user downloads content from the web, each web page becomes a corpus document. (more…)
  • document frequency (docf) [ statistics ]

    The document frequency is the number of documents in which the token or phrase appears. If the corpus has 100 documents and 2 documents contain the word city: document number 7 contains 17 instances of city, document number 31 contains 6 instances of city, the document frequency of city is 2, because 2 documents contain the word. (more…)