reference corpusA reference corpus is used in keyword extraction and term extraction. A reference corpus can also be used with n-grams. A reference corpus is a corpus to which the focus corpus is compared. When using the Keywords & Terms tool, a reference corpus is preselected but the user can use a different corpus as a reference corpora. The reference corpus can but does not have to be the same for keywords and for terms. With n-grams, using the reference corpus option will identify n-grams typical of the focus corpus in comparison with the reference corpus. see also term term extraction
regular expressionsa collection of special symbols that can be used to search for patterns rather than specific characters, e.g. to find all words starting, containing or ending in a specific sequence of characters, for example .*tion will find all words ending in tion and having an unlimited number of characters at the beginning read more»
relative frequency, frequency per million [ statistics ](also called freq/mill in the interface) a number of occurrences (hits) of an item per million tokens, also called i.p.m. (instances per million). It is used to compare frequencies between corpora of different sizes. number of hits : corpus size in millions of tokens = frequency per million The frequency per million is always related to the whole corpus or subcorpus, not to a text type. Restricting the query to one or more text types will affect the number of hits but the frequency per million will still be calculated using the number of tokens in the whole (sub)corpus. To relate the frequency per million to one or more text types, create a subcorpus from the text type(s) and restrict the query to this subcorpus.
ExampleLooking up the frequency of the word helps in the British National Corpus (112,181,015 tokens), first in the spoken Text type and then in the spoken subcorpus will produce these results.
SUBCORPUS SELECTED none none spoken 11,787,138 tokens TEXT TYPE SELECTED none spoken none HITS 3,116 302 302 FREQUENCY PER MILLION 27.75 in relation to the number of tokens in the whole corpus 2.69 in relation to the number of tokens in the whole corpus 25.62 in relation to the number of tokens in the subcorpus POSSIBLE INTERPRETATION helps appears 27.75 times per million tokens in BNC ‘spoken’ helps appears 2.69 times per million tokens in BNC helps appears 25.62 times per million tokens in the spoken part of BNC
relative text type frequency(also called Relative density in the interface) Relative text type frequency compares the frequency in a specific text type to the frequency in the whole corpus. It shows how typical the word(s) is of a specific text type, e.g. of the spoken part of the corpus or of a particular website where the texts were downloaded from. The number is the relative frequency of the query result divided by the relative size of the particular text type. It can be interpreted as how much more/less frequent is the result of the query in this text type compared to the whole corpus.
- less than 100 % – it is less frequent in this text type than in the whole corpus, it is not typical or specific of this text type
- 100 % – it as frequent in this text type as it is in the whole corpus
- more than 100 % – it is more frequent in this text type than in the whole corpus, it is typical or specific of this text type