In each language, a term can have a different format. In most situations, the requirement is for a term to be a noun phrase. For example, a term in English can be composed of nouns (N), adjectives (J) and also prepositions so the phrase should match one of these patterns N+N, N of N, J+N, J+J+N, J+N of N, J+N of J+N etc. while preposition + article + adjective is unlikely to be considered a term.
If we analyse texts from tabloid newspapers and texts from books on accounting, we are likely to find income tax and best way in both of them. Both phrases match the structure of a term in English (N+N and J+N respectively), however, the frequencies are likely to differ. While the frequency of best way is likely to be similar in both texts, the frequency of income tax is likely to be much higher in texts on accounting. This is how the system can automatically tell a frequent phrase from a term and will identify income tax as a term.
Linguistic tools for term extraction
To achieve the best possible quality, the focus text must be tagged for parts of speech first. This will ensure that each phrase in the text can be matched against the allowed term structures. Sequences of words containing undesirable parts of speech can be excluded easily with the aim to produce a clean list of relevant terms.
Lemmatization (morphological analysis)
Another important prerequisite is lemmatization. This will ensure that frequencies are calculated correctly even if the phrase is used in a different form, e.g. the frequency of income tax and income taxes should be calculated together. This is vital especially for languages such as Spanish, Russian, German and many others where each verb, noun and other parts of speech can have various endings and word form variations.
All of these tools and technology make up the term extraction functionality of Sketch Engine which is presented to the user in the easy-to-use term extraction interface of OneClick Terms.