• reference corpus

    reference corpus is used in keyword extraction and term extraction. A reference corpus can also be used with n-grams. A reference corpus is a corpus to which the focus corpus is compared. When using the Keywords & Terms tool, a reference corpus is preselected but the user can use a different corpus as a reference corpora.  The reference corpus can but does not have to be the same for keywords and for terms. With n-grams, using the reference corpus option will identify n-grams typical of the focus corpus in comparison with the reference corpus.   see also term term extraction
  • regular expressions

    a collection of special symbols that can be used to search for patterns rather than specific characters, e.g. to find all words starting, containing or ending in a specific sequence of characters, for example .*tion will find all words ending in tion and having an unlimited number of characters at the beginning read more»  
  • relative frequency, frequency per million [ statistics ]

    (also called freq/mill in the interface) a number of occurrences (hits) of an item per million, also called i.p.m. (instances per million). It is used to compare frequencies between corpora of different sizes. number of hits : corpus size in millions of tokens = frequency per million The frequency per million is always related to the whole corpus or subcorpus, not to a text type. Restricting the query to one or more text types will affect the number of hits but the frequency per million will stay calculated using the number of tokens in the whole (sub)corpus. To relate the frequency per million to one or more text types, create a subcorpus from the text type(s) and restrict the query to this subcorpus.
    Example
    Looking up the frequency of the word helps in the British National Corpus (112,181,015 tokens), first in the spoken Text type and then in the spoken subcorpus will produce these results.
    SUBCORPUS SELECTED none none spoken 11,787,138 tokens
    TEXT TYPE SELECTED none spoken none
    HITS 3,116 302 302
    FREQUENCY PER MILLION 27.75 in relation to the number of tokens in the whole corpus 2.69 in relation to the number of tokens in the whole corpus 25.62 in relation to the subcorpus size
    POSSIBLE INTERPRETATION helps appears 27.75 times per million words in BNC ‘spoken’ helps appears 2.69 times per million in BNC helps appears 25.62 times per million in the spoken part of BNC
    see also Statistics in Sketch Engine Average Reduced Frequency
  • relative text type frequency

    compares the frequency in a specific text type (part of corpus) to the whole corpus or compares frequencies in different text types (parts of corpus) even if they are not the same size. Thus the user can see whether the search word(s) is typical only for a specific text type (e.g. in newspapers only) but not in the rest of the corpus. The number is relative frequency of the query result divided by relative size of the particular text type. It can be interpreted as “how much more/less often is the result of the query in this text type in comparison to the whole corpus”. Higher frequency means higher value, bigger text type size means lower value. E.g. The word 'test' has 2000 hits in the corpus. 400 of them are in the text type “Spoken” and this text type represents 10 % of the corpus. Then the Relative Text Type frequency will be (400 / 2000) / 0.1 = 200 % and it means 'test' is twice as common in “Spoken” than in the whole corpus. see also Statistics in Sketch Engine