Words, tags, lemmas, lemposes, lowercase – what are they for?

When using Sketch Engine, every now and then the user comes across the word attribute and its values: words, tags, lemmas, lempos, lowercase and some others depending on the corpus and language. This blog post explains how these positional attributes, to use the correct terminology, work in Sketch Engine and how the user can benefit from them.

Attributes – versions of a corpus

As soon as some text is uploaded to Sketch Engine, it is divided into tokens, i.e. tokenized. A token is the smallest part of a corpus. Each word or punctuation is a token. Hello is one token. Hello! is two tokens. The next step is to convert the original text into additonal versions. Each version has its name taken from the attribute into which the original corpus is converted.

word

Each token will immediately become part of the corpus version called word which is short for word form. Word represents each token exactly as it was written in the original sentence. It is not modified by Sketch Engine in any way. The first word in the sentence will keep its capital letter, contractions such as n’t in don’t will stay as n’t. This is what a sentence will look like when it is tokenized.

see also word form

word
The
Cook
Islands
were
n’t
named
after
a
cook
but
after
James
Cook
who
landed
on
the
islands
in
1773
to
explore
the
land
.

Vertical text

The sentence is presented in the format of a vertical text, i.e. one token per line. This is the standard format of storing corpora in Sketch Engine. This format allows adding more attributes (columns) to each token easily.

word (lowercase) or lc

Word (lowercase), sometimes displayed as lc, is next version of the corpus, i.e. the next positional attribute. To generate this attribute, all tokens in the corpus are converted to lowercase including proper nouns (Londonlondon, Peterpeter, WiFiwifi, WIFIwifi). Using this attribute (column) for searching will make the search case insensitive. Searching for cook, will find both Cook and cook. Searching for Cook will find nothing.

When generating frequency lists using this attribute, lowercase and upper case variants of the word will be treated as the same words. To get a separate frequency for WiFi, WIFI and wifi, use the word attribute.

lc is added as an additional column to the vertical text. Logically, the lc attribute is only present if the script distinguishes between lowercase and upper case. No lc for Chinese or corpora in Indian scripts.

To make the search or analysis work with the lc or word (lowercase) attribute:

  • activate the A = a option in the input form
    Lowercase switch
  • if not available, choose the word (lowercase), word_lc or lc option from the list of available attributes (the names can differ between corpora).

see also lc

Please study this vertical text and compare the frequency lists generated on the word and word (lowercase) attributes. The lc list will always be shorter, it will contain a smaller variety of items because the distinction between upper case and lowercase is lost.

word lc
The the
Cook cook
Islands islands
were were
n’t n’t
named named
after after
a a
cook cook
but but
after after
James james
Cook cook
who who
landed landed
on on
the the
island islands
in in
1773 1773
to to
explore explore
the the
land land
. .
Frequency list on word attribute
Frequency word list on the lc attribute

lemma

The lemma is the form of the word  found in dictionaries, sometimes called the base form. Introducing lemmas makes it possible to treat different word forms of the word as the same word. This is especially useful with morhpologically rich languages, i.e. languges where lemmas can have many differet word forms (Spanish, French, Polish, Japanese, Turkish, Russian etc.).

The existence of the lemma makes it possible to type go and find go, goes, going, gone and went automatically. A wordlist generated on the lemma attribute will count the frequencies of go, goes, going, gone and went together and display them as one item: go. To find their individual frequencies, the lc attribue should be used.

The lemma preserves the original capitalization but, typically, the first word of a sentence will be lowercased.

In most languages (German is one of the exceptions), when a capitalized word is found in the middle of the sentence, the lemmatizer identifies it as unusual usage, possibly a brand name or proper noun, and will assign a lemma which is identical to the word form. Compare islands island  but IslandsIslands.

see also lemma

Compare these frequency lists generated on the word and lemma attributes. The lemma itself does not differentiate between parts of speech, therefore landed and land are counted as the same lemma despite being a verb and a noun. The Sketch Engine interface, however, features functionality to take the part of speech into account if needed.

word lc lemma
The the the
Cook cook Cook
Islands islands Islands
were were be
n’t n’t not
named named name
after after after
a a a
cook cook cook
but but but
after after after
James james James
Cook cook Cook
who who who
landed landed land
on on on
the the the
islands islands island
in in in
1773 1773 [number]
to to to
explore explore explore
the the the
land land land
. . .
Frequency list on word attribute
Frequency list on the lemma attribute

lemma (lowercase)

Lemma (lowercase), sometimes shown as lemma_lc is used to ignore the differences in lemma capitalisation. This is analogous to the difference between word and lc (see above). Searching a corpus with the lemma (lowercase) attribute allows the user to type cook and find both cook, cooks and Cook.

To make the search or analysis work with the lemma (lowercase) attribute:

  • activate the A = a option in the input form
    Lowercase switch
  • if not available, choose the lemma (lowercase) or lemma_lc option from the list of available attributes (the names can differ between corpora).

see also lemma_lc

word lc lemma lemma_lc
The the the the
Cook cook Cook cook
Islands islands Islands islands
were were be be
n’t n’t not not
named named name name
after after after after
a a a a
cook cook cook cook
but but but but
after after after after
James james James james
Cook cook Cook cook
who who who who
landed landed land land
on on on on
the the the the
islands islands island island
in in in in
1773 1773 [number] [number]
to to to to
explore explore explore explore
the the the the
land land land land
. . . .
Frequency list on the lemma attribute
Word frequency list on the lemma lowercase attribute

tag or POS tag or part-of-speech tag

The tag attribute contains POS tags with information about the part of speech of each token and usually also other grammatical or morphological information such as number, gender, tense etc. Tags are assigned automatically by a tagger.
Using the tag for searching makes it possible to find all words with the same part of speech. Combining the tag with other attributes makes it possible to only find words when used (or not used) as a specific part of speech.
A frequency list of tags will provide information about how frequent each part of speech is in the corpus.

see also POS tag

word lc lemma lemma_lc tag
The the the the DT
Cook cook Cook cook NP
Islands islands Islands islands NP
were were be be VBD
n’t n’t not not RB
named named name name VVN
after after after after IN
a a a a DT
cook cook cook cook NN
but but but but CC
after after after after IN
James james James james NP
Cook cook Cook cook NP
who who who who WP
landed landed land land VVD
on on on on IN
the the the the DT
islands islands island island NNS
in in in in IN
1773 1773 [number] [number] CD
to to to to TO
explore explore explore explore VV
the the the the DT
land land land land NN
. . . . SENT

Tagset

The complete list of tags used in a corpus is called a tagset and can be accessed via the corpus info page.

lempos and lempos_lc

The lempos attribute was introduced mainly to make the computation of the word sketch and thesaurus possible. Lempos stands for lemma + POS. It is a combination of lemma and a one-word abbreviation of the part of speech. Parts of speech not supported by the word sketch all use the same suffix -x.

The lempos_lc or lempos (lowercase) is the lowercase version of lempos.

see also lempos and lempos_lc

word lc lemma lemma_lc tag lempos lempos_lc
The the the the DT the-x the-x
Cook cook Cook cook NP Cook-n cook-n
Islands islands Islands islands NP Islands-n islands-n
were were be be VBD be-v be-v
n’t n’t not not RB not-a not-a
named named name name VVN name-v name-v
after after after after IN after-i after-i
a a a a DT a-x a-x
cook cook cook cook NN cook-n cook-n
but but but but CC but-c but-c
after after after after IN after-i after-i
James james James james NP James-n james-n
Cook cook Cook cook NP Cook-n cook-n
who who who who WP who-x who-x
landed landed land land VVD land-v land-v
on on on on IN on-i on-i
the the the the DT the-x the-x
islands islands island island NNS island-n island-n
in in in in IN in-i in-i
1773 1773 [number] [number] CD [number]-m [number]-m
to to to to TO to-x to-x
explore explore explore explore VV explore-v explore-v
the the the the DT the-x the-x
land land land land NN land-n land-n
. . . . SENT .-x .-x

Vertical file download

A user corpus can be downloaded as a plain text file or vertical text. The latter option only includes 3 attributes: word, tag and lempos and also structures and their attributes (metadata).

Vertical text cannot be downloaded with all the columns shown on this page. They are included here for clarity only.

How to display attributes

Concordance

Attributes can be viewed easily in the concordance. The concordance can be generated:

  • from scratch using a concordance search,
  • by jumping to the concordance via the local menu next to each result in other tools.

In the concordance, the view options offer the complete selection of attributes.

concordance view options

Tip

Sketch Engine remembers your view settings for each corpus. Only keep the attributes displayed if you really need to see them. Otherwise hide them to keep the screen neat and tidy and easy to work with. The display of many attributes and many concordance lines on one screen can slow your browser down.

Wordlist and n-grams

To include other attributes in the wordlist, use the ADVANCED tab to select the required attribute. To include more than one attribute, use the Display as option.

Word Sketch and thesaurus

The default attribute is set in the configuration file. Changing it may require writing a new sketch grammar.

Keywords & terms

Keywords – use the advanced tab to change the attribute for keywords.

Terms – the attribute is set in the term grammar. Changing the attribute may require writing a new term grammar. The reference corpus must be processed with the same term grammar as focus corpus.

corpus from the web
blog: pos tags

POS tags

Screenshot from OneClick Terms – term extraction tool
Screenshot of thesaurus from esTenTen Spanish corpus

Automatic thesaurus