• ARF – Average Reduced Frequency [ statistics ]

    a modified frequency which prevents the result to be excessively influenced by one part of the corpus (e.g. one or more documents) which contains a high concentration of the token. If the token is evenly distributed across the corpus, ARF and frequency per million will be comparable. see also ARF definition  
  • CAT tool

    A CAT tool is a computer assisted translation tool, a software that helps translators maintain consistency in terminology across their translation jobs and also aids the translation process by suggesting (or translating automatically) passages which the translator already translated in the past.
  • cluster

    a process of creating groups of words in the thesaurus or word sketch. Words are connected to their shared collocational behaviour. See more on the Clustering Neighbours documentation
  • collocate

    a part of a collocation that is not a node, e.g. the collocate strong and the node wind, make up the collocation strong wind
  • collocation

    a collocation is a sequence of words or terms that co-occur more often than would be expected by chance (from Wikipedia|Collocation) A collocation, e.g. fatal error, typically consists of a node (error) and a collocate (fatal). Collocations can have different strength, e.g. nice house is a weak collocation because both nice and house can combine with lots of other words, on the other hand, the Opera House is a strong collocation because it is very typical for opera to occur next to house and, at the same time, opera does not combine with many other words.
  • comparable corpus [ corpus-types ]

    A comparable corpus is a corpus consisting of texts from the same domain in more languages. In contrast to a parallel corpus, the texts are not translations of each other and belong to the same domain with the same metadata. An example of a comparable corpus is corpus made from Wikipedia.
  • compile

    A corpus compilation refers to the processing of the corpus data (text) with the tools available for the language and converting the text into a corpus.Only a compiled corpus can be searched. see corpus compilation
  • concordance [ feature ]

    a list of all examples of the search word or phrase found in a corpus, usually in the format of a KWIC concordance with the search word highlighted in the centre of the screen and some context to the right and to the left read more»
  • concordancer [ feature ]

    A concordancer is a tool (a piece of software) which searches a text corpus and displays a concordance. A concordancer is one of the features in Sketch Engine which allows for simple corpus searches as well as queries involving complex criteria that search for grammatical or lexical structures. see also concordance
  • cooccurrence [ text-analysis ]

    cooccurrence or co-occurrence is a term which expresses how often two terms from a corpus occur alongside each other in a certain order. It usually indicates words which together create a new meaning. We call them as phraseme or multi-word expression, e.g. black sheep or get on. Sketch Engine help to find such words with using the word sketch tool or the collocation search. Read more about further tools for text analysis.
  • corpus

    a large collection of texts used for studying language. A corpus is usually annotated (=word are labelled with information about the part of speech and grammatical category). The terms corpus and text corpus and language corpus are interchangeable. Using a corpus for any type of linguistic or language oriented work ensures the outcomes reflect the real use of the language. more on copora»
  • corpus architect

    an intuitive tool inside Sketch Engine for creating corpora from documents or the Web which does not require any expert knowledge. See the create your own corpus    page.
  • corpus manager

    a program used to manage text corpora, i.e. to build, edit, annotate and search corpora. Sketch Engine is the user interface to the corpus manager Manatee.
  • CQL

    The Corpus Query Language is a code used to set criteria for complex searches which cannot be carried out using the standard user interface controls. The criteria may not only include words or lemmas but also tags, text types and other attributes. Logical operators (AND/OR/NOT) can be used. [av_button label='Learn CQL' link='manually,https://www.sketchengine.eu/corpus-querying/' link_target='' size='small' position='center' label_display='' icon_select='yes-right-icon' icon_hover='aviaTBicon_hover' icon='ue875' font='entypo-fontello' color='theme-color' custom_bg='#444444' custom_font='#ffffff' av_uid='' admin_preview_bg='']
  • CSV

    a type of plain text document used for saving tabular data. It is seamlessly accepted by a large variety of applications and is therefore ideal for exporting Sketch Engine results to be used in other software. CSV can be opened directly in Microsoft Excel, Open Office, Google Documents and many others.
  • deduplication

    is a process during which repeating same texts are removed and the only first text of all same (duplicated) texts is kept. The deduplication process can be carried out at various levels, e.g. documents. It means that one whole document of two same ones will be removed.
  • disambiguation

    a process of identifying meanings of words (lemma, part of speech) when a word has multiple meanings. The result of this process is one word with one meaning.
  • distributional thesaurus [ feature ]

    an automatically produced thesaurus which identifies words that occur in similar contexts as the target word. It draws on the hypothesis of distributional semantics. The automatically produced thesaurus is available for each word in the corpus. more about automatic thesaurus The distributional thesaurus in Sketch Engine is available for every language and corpus that supports word sketches. Refer to user manual to learn to generate the thesaurus.
  • freq/mill – frequency per million [ statistics ]

    a number of occurrences (hits) of an item normalised per million, also called as i.p.m. (instances per million). It is used to compare frequencies between corpora of different sizes. number of hits : corpus size in millions of tokens = frequency per million Example: A token found 10 times in a corpus of 1 million tokens will have a frequency per million equal to 10. A token found 100 times in a corpus of 100 million tokens will have a frequency per million equal to 1. The second token is less frequent. see also Statistics in Sketch Engine Frequency per million Average Reduced Frequency
  • GDEX

    Good Dictionary Examples are sentences which are suitable are dictionary example sentences, i.e. are illustrative and representative. A concordance can be sorted with the best GDEX sentences at the top. Sketch Engine evaluates the sentences with respect to the sentence length and complexity, safe topics, the presence of difficult and low-frequency words and other similar criteria specified in the GDEX configuration. more on GDEX
  • global subcorpus

    a subcorpus that is shared with all users. See instructions how to set the subcorpus shared all users»
  • header field

    various types of information associated with documents of a corpus, e.g. a corpus with documents from different domains can be structured according to these domains with a usage of header fields <doc domain> and their values "nameofdomain" = <doc domain="nameofdomain">
  • KWIC

    KWIC is the acronym for Key Word in Context and refers to the red text highlighted in a concordance. The red text is the result that matches the search criteria. Such a concordance might be referred to as a KWIC concordance. KWIC and KWIC concordance
  • lc [ attribute ]

    word form lowercase, i.e. case insensitive word form, done is the same as Done. see word form
  • learner corpus [ corpus-types ]

    A collection of texts produced by learners of a language used to study errors and mistakes made by learners of languages. Learner corpora in Sketch Engine can use both error and correction annotation. A special search interface is available to search by the former or the latter or both. see also Setting up a learner corpus
  • lemma [ attribute ]

    Lemma is the basic form of a word, typically the form found in dictionaries. Searching for lemma will also include all forms of a word in the result, e.g. searching for lemma go will find go, goes, went, going, gone. Lemma is case sensitive. go and Go are two different lemmas. see also lemma-lc or compare with word form
  • lemma_lc [ attribute ]

    lemma-lc is a case insensitive lemma. All upper-case characters are converted to lowercase. apple and Apple is the same thing. see lemma
  • Lemmatization

    Lemmatization is a process of assigning a lemma to each word form in a corpus using an automatic tool called a lemmatizer. Lemmatization bring the benefit of searching for a base form of a word and getting all the derived forms in the result, e.g. searching for go will also find goes, went, gone, going.
  • lempos [ attribute ]

    lempos is a combination of lemma and part of speech (pos) consisting of the lemma, hyphen and a one-letter abbreviation of the part of speech, eg. go-vhouse-n. The part of speech abbreviations differ between corpora. Lempos is case sensitive, house-n is different from House-n.  see also lempos_lc
  • lempos_lc [ attribute ]

    lempos_lc is a case insensitive counterpart of lempos. All uppercase letters are converted to lowercase, thus House-n becomes identical with house-n.
  • likelihood [ statistics ]

    a function of parameters of a statistical model, it plays a key role in statistical inference and is the basis for the log-likelihood function. see Statistics in Sketch Engine
  • log-likelihood [ statistics ]

    one of the functions used in computed statistics of Sketch Engine. It is the association measures based on the likelihood function, using in tests for significance (see the log-likelihood calculator and more details)
  • logDice [ statistics ]

    a statistic measure for identifying collocation candidates which is used in the word sketch feature. It is based only on a frequency of words w_1 and w_2 and the bigram w_1w_2, it is not affected by a size of the corpus See logDice in Statistics used in Sketch Engine.
  • Longest-commonest match

    The longest-commonest match is a concept coined by Adam Kilgarriff to name the most common realisation of a collocation, i.e. the chunk of language in which the collocation appears most frequently. The longest-commonest match is part of the word sketch result screen to facilitate the understanding of how the collocation typically behaves.
  • metadata

    information about the texts in the corpus: for example, year of publication, author name, publishing house, medium (written, spoken), register (formal, informal) etc. Metadata are automatically converted to text types in Sketch Engine. see Annotate a corpus  
  • MI Score [ statistics ]

    The Mutual Information score expresses the extent to which words co-occur compared the number of times they appear separately. MI Score is affected strongly by the frequency, low-frequency words tend to reach a high MI score which may be misleading. This is why Sketch Engine allows setting a limit and words with a frequency below this limit will not be included in the calculation. In most cases T-score is more useful than MI score. see Concordance - Collocations see Statistics in Sketch Engine compare T-score
  • minimum sensitivity [ statistics ]

    a statistics measure similar to logDice which is the minimum of the two following numbers:

    • the number of co-occurrences divided by the frequency of the collocate
    • the number of co-occurrences divided by the frequency of the node word

    The minimum sensitivity number grows with a high number of co-occurrences and falls with a high number of occurrences of the individual words (node word or collocate).

  • multilevel list

    a list sorted at more than one level e.g. a frequency list sorted by word form followed by lemma and then tag, see this multilevel list in the BAWE corpus.
  • n-gram

    is a sequence of a number of structures (bigram = 2 structures, trigram = 3 structures...n-gram = n structures) typically letters or words but also phonemes or syllables. Generating a frequency list of such sequences can help us notice which structures tend to combine in a language. n-grams are generated using the word list feature.
  • node

    (collocation) central word in a collocation, e.g. strong wind consists of the collocate strong and the node wind (concordance) the search word or phrase, sometimes called a query, appears in the centre of a KWIC concordance or highlighted in other types of concordances
  • non-word

    generally speaking, non-words are tokens which do not start with a letter of the alphabet. Examples of non-words: !mportant, 2U (There might be rare cases when the corpus author uses a different definition in their corpus. Such a definition is part of the corpus configuration file.)
  • overall score [ statistics ]

    score of the relation based on logDice in word sketches. The score is displayed in the header of each column of the relation.
  • parallel corpus [ corpus-types ]

    A parallel corpus is a corpus consisting of the same text in two languages. The texts are aligned (matching segments, usually sentences are linked). The corpus allows searches in one or both languages to look up translations. parallel_key
  • PoS

    part of speech, some typical examples of parts of speech are: noun, adjective, verb, adverb etc.
  • POS tag [ attribute ]

    POS tag stands for part-of-speech tag - a label with information about part of speech and grammatical categories assigned to each token in a corpus. It is often shortened to tag.
  • POS tagger

    POS (part of speech) tagging is a process of annotating each token with a tag carrying information about the part of speech and often also morphological and grammatical information such as number, gender, case, tense etc. The automatic tagging tool is called a tagger or POS tagger.
  • positional attribute

    information added to each token in a corpus, e.g. its lemma (basic form of a word) or part of speech. Attributes differ between corpora and even between corpora in the same language. Attribues are listed on the corpus statistics and detail page For example,
    word lemma tag lempos
    dogs dog n dog-n
  • preloaded corpus [ corpus-types ]

    a ready-to-use corpus included in Sketch Engine subscription or Trial access, not created by a user, e.g. British National Corpus
  • query

    a sequence of characters or words or their combinations inputed by the user in order to retrieve a concordance. Often, the word query is not restricted to the concordance only but can also refer to any type of search or criteria uses in connection with any Sketch Engine feature, i.e. Word Sketch, thesaurus, word list etc.
  • reference

    an attribute of the document describing this document, e.g. a URL of a document. These are information about each document in a corpus.
  • reference corpus

    a corpus chosen as a standard of comparison with your corpus.  The reference corpus is used for the search terms (keywords).
  • regular expressions

    a collection of special symbols that can be used to search for patterns rather than specific characters, e.g. to find all words starting, containing or ending in a specific sequence of characters, for example .*tion will find all words ending in tion and having an unlimited number of characters at the beginning read more»  
  • relative text type frequency

    compares the frequency in a specific text type (part of corpus) to the whole corpus or compares frequencies in different text types (parts of corpus) even if they are not the same size. Thus the user can see whether the search word(s) is typical only for a specific text type (e.g. in newspapers only) but not in the rest of the corpus. The number is relative frequency of the query result divided by relative size of the particular text type. It can be interpreted as “how much more/less often is the result of the query in this text type in comparison to the whole corpus”. Higher frequency means higher value, bigger text type size means lower value. E.g. The word 'test' has 2000 hits in the corpus. 400 of them are in the text type “Spoken” and this text type represents 10 % of the corpus. Then the Relative Text Type frequency will be (400 / 2000) / 0.1 = 200 % and it means 'test' is twice as common in “Spoken” than in the whole corpus. see also Statistics in Sketch Engine
  • salience [ statistics ]

    a statistical measure of the significance of a specific token in the given context. This is measured with logDice, for more information, see section 3 of Statistics used in Sketch Engine)
  • search attribute

    the attribute that is used for the search and creating a word list. You can have the word list of words, lemmas, tags, etc.
  • search span

    the number of tokens either side of the node that will be matched for filtering concordance. The set search span from -5 to 5 means filter all concordance lines which containing a requirement of the filter in the range of 5 tokens around the node.
  • simple math [ statistics ]

    the simple formula used for the computation and identification of terms and keywords. see Simple math.
  • stemming

    stemming is the process during which a word reduces its affixes (suffixes, prefixes, etc.) and finally, the stem only remains. Stemming is used to detect related words with the same stem, the word root which does not change in any case, number or tense. The word stems are available in Portuguese corpus ptTenTen. This analysis is processed with tools call stemmers.
  • structure

    a corpus structure refers to the segments or parts into which a corpus can be divided. Typically, a corpus is divided into sentences, paragraphs and documents but corpora can use various other structures depending on the type of corpus. see a list of common corpus structures see Dividing a corpus into smaller parts and annotating them
  • subcorpus

    a corpus can be subdivided into an unlimited number of parts called subcorpora. Subcorpora can be used to divide the corpus by the type (fiction, newspaper), media (spoken, written) or time (e.g. by years) or by any other criteria. A subcorpus can also be created from a concordance by including all concordance lines and the documents they come from into a subcorpus. How to create a subcorpus»
  • T-score [ statistics ]

    T-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation which is why very frequent word combinations tend to reach a T-score high value despite not being significant as collocations. In most cases, T-score is more reliable or more useful than MI Score. see Concordance - collocations see Statistics in Sketch Engine compare MI Score
  • tag [ attribute ]

    (also called morphological tag or POS tag) a label assigned to each token in an annotated corpus to indicate the part of speech and grammatical category. The tool used to annotate a corpus is called a tagger. A collection of tags used in a corpus is called a tagset. See our blog about POS tags.
  • tagset

    (called also tag set) is a list of part-of-speech tags used in one corpus. In Sketch Engine, corpora in the same language tend to use the same tagset but exceptions exist. To check the tagset used, access Corpus statistics and details. See our blog about POS tags.
  • TBL

    application in Sketch Engine for collecting usage-example sentences to build dictionaries. Find more on the Tick Box Lexicography page
  • term

    a keyword or multi-word term that is more frequent in one corpus compared to another one and at the same time it is not a common word(s) like "the, house, at the, ...". Hence, this is the term significant for the corpus. See more on term extraction»
  • term base

    In connection with CAT tools, a term base is a database of subject-specific terminology and other lexical items which need to be translated consistently. The CAT tool uses the term base to check the consistency of translation, to look for untranslated segments, and to suggest (or automatically supply) translations of the terms from the database.
  • term extraction

    the process of identifying subject specific vocabulary in a subject specific text usually using specialized software. The finding of one-word and multi-word terms in Sketch Engine is based on a comparison with the frequency of these words and phrases in a reference corpus.
  • text analysis [ text-analysis ]

    text analysis (also content analysis) is a method for analyzing texts in order to gain information from them. The result of the content analysis is structured data which can be used for further processing. Sketch Engine offers a one-page automatic summary of a word's collocations with the word sketch feature. See also other text analysis tools.
  • text mining [ text-analysis ]

    text mining is an automatic process of extracting information from text, such as keywords of a text or its source(s). The corresponding tools in Sketch Engine are WebBootCaT for creating corpora from the web or keywords and terms extraction which finds terminology in your texts. Read about other text analysis tools.
  • text type

    a text type is a term used when talking about text corpora which refers to values assigned to structures (e.g. documents, paragraphs, sentences and others) inside a corpus. Text types are sometimes called metadata or headers. Text types can refer to the source (newspaper, book etc.), medium (spoken, written), time (year, century) or any other type of information about text. Not all corpora have documents annotated for text types. Corpora can be divided into subcorpora based on text types and searches and other analysis can be performed only on texts belonging to the selected text type.
  • token

    Token is the smallest unit that each corpus divides to. Typically each word form and punctuation (comma, dot, ...) is a separate token (but don't  in English consists of 2 tokens). Therefore, corpora contain more tokens than words. Spaces between words are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.
  • tokenization

    Tokenization is the automatic process of separating text into tokens.
  • tokenizer

    A tokenizer is a tool (software) used for dividing text into tokens. A tokenizer is language specific and takes into account the peculiarities of the language, e.g. don't in English is tokenized as two tokens. Sketch Engine contains tokenisers for many languages and also a universal tokenizer used for languages not yet supported by Sketch Engine. The universal tokenizer only recognizes whitespace characters as token boundaries ignoring any language specific rules. This, however, is sufficient for the use of many Sketch Engine features.
  • translation memory

    A translation memory is a database inside a CAT tool which holds segments of text translated in the past. The CAT tool can suggest (or automatically supply) translations based on matching text from the translation memory.
  • trends

    Trends is a feature used for diachronic analysis, i.e. for identifying how the frequency of the word (or other attributes) changes over time. read more
  • UMS

    feature available to users with local installation for the administration of users and corpora.
  • user corpus [ corpus-types ]

    a corpus created by a user. Users can create corpora by uploading their own data or using Sketch Engine to collect data from the Web. User corpora can be shared with other users.
  • vertical file

    A vertical file is a text file where each token (or word) is on a separate line. This format is typically used for text corpora and may contain additional metainformation (annotation). The first column contains tokens and structures, the other columns may contain part of speech, lemmas or other positional attributes. An example of a vertical file:
    Text		NN	text-n
    corpora		NN	corpus-n
    are		VBP	be-v
    comprised	VVN	comprise-v
    of		IN	of-i
    column 1: tokens and structures column 2: part of speech tags column 3: lempos attribute
  • web mining [ text-analysis ]

    web mining is the application of data mining which extracts information from texts. The web mining is focused on gaining information and metadata from the web. For this task, Sketch Engine uses the fully-automated tool WebBootCaT for creating corpora from the web which stores also metadata of processed websites. Read about other text analysis tools.
  • word form [ attribute ]

    A word form refers to one form that a word can take, e.g. the word go can take these word forms go, went, gone, goes, going. Searching for the word form going will not find any other forms of the word. It is case sensitiveapple and Apple are two different word forms.
  • word list

    A word list is a generic name for various types of lists such as list of words, lemmas, POS tags or other attributes with their frequency (hit counts, document counts or others).
  • word sketch

    A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. more»
  • Word Sketch grammar

    Word Sketch grammar (WSG) is a set of rules defining the grammatical relations (=columns/categories) in a Word Sketch. WSG is language dependent, the same WSG cannot be shared across languages. Different corpora in the same language can use the same or different WSG. Users can write their own WSG to match their specific need. Corpora in unsupported languages can make use of a universal WSG which provides only basic statistics of words surrounding the keywords ignoring the grammar of the language. The universal WSG can also be modified by the user. more»