• web mining [ text-analysis ]

    web mining is the application of data mining which extracts information from texts. The web mining is focused on gaining information and metadata from the web. For this task, Sketch Engine uses the fully-automated tool WebBootCaT for creating corpora from the web which stores also metadata of processed websites. Read about other text analysis tools.
  • Word

    Note: This entry is for the type of token.  For the positional attribute, see word form. A word is a type of token. All tokens in a corpus are divided into two groups: words and nonwords.  Words are tokens which begin with a letter of the alphabet. Tokens such as book, working, Mary, T-shirt, post-1945, mp3 or CO2 are words because they start with a letter. The regular expression Sketch Engine users to identify words is [[:alpha:]].*  Compare to nonword.
  • word form [ attribute ]

    This entry is for the positional attribute: word form, lemma, lowercase, tag… For the type of token, the opposite of nonword, see word. The word form (often shortened to word in the interface) is a positional attribute. It refers to one of the word forms that a  lemma can take, e.g. the lemma go can take these word forms go, went, gone, goes, going. (more…)
  • word list

    A word list is a generic name for various types of lists such as list of words, lemmas, POS tags or other attributes with their frequency (hit counts, document counts or others).
  • word sketch

    The word sketch is a tool to display collocations (=word combinations) in a compact, easy-to-understand way. The word sketch makes it easy to understand how a word behaves, which contexts it typically appears in and which words it can be used together. (more…)
  • Word Sketch grammar

    Word Sketch grammar (WSG) is a set of rules defining the grammatical relations (=columns/categories) in a Word Sketch. In other words, WSG tells Sketch Engine which words should be regarded as collocations of the search word and also what type of collocation they are. (more…)
  • Word sketch triple

    A word sketch triple is a data format used for representing one collocation identified by the word sketch. A word sketch triple consists of:
    • node as lempos
    • name of the grammatical relation as displayed in the header of the column in word sketch interface
    • collocate as lempos.
    school-n modifiers of "%w" secondary-j
    (more…)