• gender lemma [ attribute ]

    The gender lemma is an attribute used in connection with term extraction. Its purpose is to display terminology in the correct word form in languages which distinguish gender with adjectives and nouns. Lemma would produce a grammatically unacceptable word form combination. Examples Spanish
    word form lemma gender lemma
    cámaras compactas cámara compacto cámara compacta
    Russian
    Красной площади красный площадь Красная площадь
    Polish
    piłce nożnej piłka nożny piłka nożna
  • lc [ attribute ]

    (also referred to as word_lc, word lowercase or word form lowercase) is a positional attribute assigned to of each token in the corpus. The lc attribute is a lowercased version of the word attribute: John becomes john, Apple becomes apple, BE becomes be. The lc attribute makes the upper case and lowercase version of each token identical. The lc attribute is used for case insensitive searching and analysis see also word form lemma (lowercase) list of attributes
  • lemma [ attribute ]

    Lemma is a positional attribute. It is the basic form of a word, typically the form found in dictionaries. A lemmatized corpus allows for searching for the basic form and include all forms of the word in the result, e.g. searching for lemma go will find go, goes, went, going, gone. Lemma in Sketch Engine is case sensitive so City and city are two different lemmas (City = the City of London; city = a common noun). The lemma of the first word of a sentence is always lowercase. Therefore, the search for lemma city will also find City but only in if City appears at the beginning of a sentence. A wordlist of lemmas is a frequency list where all of go, went, gone, goes, going are counted together and listed as go. A lemma search of go will find all of go, went, gone, goes, going. The concept of the lemma is not always clearly defined and may differ between languages (or even between two corpora in the same language). For example, in Sketch Engine, many, more, most are three different lemmas in English. On the other hand, in Czech, the same adjective which is also irregular mnoho, více, nejvíce share the same lemma hodně. The situation is even more complex with agglutinating languages such as Turkish, Hungarian or Japanese where it may not be easy to decide how many affixes should be removed to produce a lemma. The term stem often replaces the term lemma but stem often refers to the very core part of the word while several lemmas may share the same stem. In Sketch Engine, all corpora in the same language are processed using the same tools and therefore have the same lemmatization. Rare exceptions exist if the corpus was acquired from external sources including the original lemmatization. See also lemma-lc word form lempos list of attributes
  • lemma_lc [ attribute ]

    lemma_lc is a positional attribute. It is a lemma converted to lowercase.   apple and Apple are treated as the same thing. It is used for case insensitive searching and case insensitive analysis. see lemma
  • lempos [ attribute ]

    Lempos is a positional attribute, i.e. an attribute assigned to each token in the corpus.  It is a combination of lemma and part of speech (pos) consisting of the lemma, hyphen and a one-letter abbreviation of the part of speech, eg. go-vhouse-n. The part of speech abbreviations differ between corpora. Lempos is case sensitive, house-n is different from House-n. see also lempos_lc lemma list of attributes
  • lempos_lc [ attribute ]

    lempos_lc is a positional attribute. It is a lowercased version of lempos. All uppercase letters are converted to lowercase, thus House-n becomes identical with house-n. It is used for case insensitive searching and analysis. see also lempos list of attributes
  • longtag [ attribute ]

    Longtag is a detailed part-of-speech tag which usually contains more information than tag. Some corpora have tags containing only basic information on parts of speech and also attribute longtags consist of detailed grammatical information such as case, number, gender, etc. The longtangs are available in Estonian corpus etTenTen or Turkis corpus trTenTen.
  • POS tag [ attribute ]

    A POS tag is the same as tag.
  • stem [ attribute ]

    A stem is a part of a word without its affixes (suffixes, prefixes, etc.). Stems do not have to be valid word forms, e.g. stem hav for the word form having, in comparison to lemma have for the word form having. Stems are used instead of lemmas or in addition to lemmas with languages whose morphology requires it. An example are agglutinating languages such as Turkish, Hungarian or Japanese.
  • tag [ attribute ]

    (also called part-of-speech tag, POS tag or morphological tag) is a label assigned to each token in an annotated corpus to indicate the part of speech and grammatical category. The tool used to annotate a corpus is called a tagger. A collection of tags used in a corpus is called a tagset. The most frequently used tags in a corpus are listed on the corpus information page with a link to the complete tagset. Our blog post on POS tags explains how they work.
  • word form [ attribute ]

    This entry is for the positional attribute: word form, lemma, lowercase, tag… For the type of token, the opposite of nonword, see word. The word form (often shortened to word in the interface) is a positional attribute. It refers to one of the word forms that a  lemma can take, e.g. the lemma go can take these word forms go, went, gone, goes, going. A list of word forms is a list where each of go, went, gone, goes, going is listed separately and their frequencies are also calculated separately. A search using word forms is a search which will only find the word form(s) that is typed in the input form. It will not find the other word forms belonging to the same lemma. The word form is case-sensitiveapple and Apple are two different word forms. Compare word_lc (lowercase) lemma lemma lc (lowercase) See also list of attributes token