Words, tags, lemmas, lemposes, lowercase – what are they for?
When using Sketch Engine, every now and then the user comes across the word attribute and its values: words, tags, lemmas, lempos, lowercase and some others depending on the corpus and language. This blog post explains how these positional attributes, to use the correct terminology, work in Sketch Engine and how the user can benefit from them.
Attributes – versions of a corpus
As soon as some text is uploaded to Sketch Engine, it is divided into tokens, i.e. tokenized. A token is the smallest part of a corpus. Each word or punctuation is a token. Hello is one token. Hello! is two tokens. The next step is to convert the original text into additional versions. Each version has its name taken from the attribute into which the original corpus is converted.
Each token will immediately become part of the corpus version called word which is short for word form. Word represents each token exactly as it was written in the original sentence. It is not modified by Sketch Engine in any way. The first word in the sentence will keep its capital letter, contractions such as n’t in don’t will stay as n’t. This is what a sentence will look like when it is tokenized.
see also word form
The sentence is presented in the format of a vertical text, i.e. one token per line. This is the standard format of storing corpora in Sketch Engine. This format allows adding more attributes (columns) to each token easily.
word (lowercase) or lc
Word (lowercase), sometimes displayed as lc, is next version of the corpus, i.e. the next positional attribute. To generate this attribute, all tokens in the corpus are converted to lowercase including proper nouns (London⇢london, Peter⇢peter, WiFi⇢wifi, WIFI⇢wifi). Using this attribute (column) for searching will make the search case insensitive. Searching for cook, will find both Cook and cook. Searching for Cook will find nothing.
When generating frequency lists using this attribute, lowercase and upper case variants of the word will be treated as the same words. To get a separate frequency for WiFi, WIFI and wifi, use the word attribute.
lc is added as an additional column to the vertical text. Logically, the lc attribute is only present if the script distinguishes between lowercase and upper case. No lc for Chinese or corpora in Indian scripts.
see also lc
Please study this vertical text and compare the frequency lists generated on the word and word (lowercase) attributes. The lc list will always be shorter, it will contain a smaller variety of items because the distinction between upper case and lowercase is lost.
The lemma is the form of the word found in dictionaries, sometimes called the base form. Introducing lemmas makes it possible to treat different word forms of the word as the same word. This is especially useful with morhpologically rich languages, i.e. languges where lemmas can have many differet word forms (Spanish, French, Polish, Japanese, Turkish, Russian etc.).
The existence of the lemma makes it possible to type go and find go, goes, going, gone and went automatically. A wordlist generated on the lemma attribute will count the frequencies of go, goes, going, gone and went together and display them as one item: go. To find their individual frequencies, the lc attribue should be used.
The lemma preserves the original capitalization but, typically, the first word of a sentence will be lowercased.
In most languages (German is one of the exceptions), when a capitalized word is found in the middle of the sentence, the lemmatizer identifies it as unusual usage, possibly a brand name or proper noun, and will assign a lemma which is identical to the word form. Compare islands ⇢ island but Islands ⇢ Islands.
see also lemma
Compare these frequency lists generated on the word and lemma attributes. The lemma itself does not differentiate between parts of speech, therefore landed and land are counted as the same lemma despite being a verb and a noun. The Sketch Engine interface, however, features functionality to take the part of speech into account if needed.
Lemma (lowercase), sometimes shown as lemma_lc is used to ignore the differences in lemma capitalisation. This is analogous to the difference between word and lc (see above). Searching a corpus with the lemma (lowercase) attribute allows the user to type cook and find both cook, cooks and Cook.
see also lemma_lc
tag or POS tag or part-of-speech tag
The tag attribute contains POS tags with information about the part of speech of each token and usually also other grammatical or morphological information such as number, gender, tense etc. Tags are assigned automatically by a tagger.
Using the tag for searching makes it possible to find all words with the same part of speech. Combining the tag with other attributes makes it possible to only find words when used (or not used) as a specific part of speech.
A frequency list of tags will provide information about how frequent each part of speech is in the corpus.
see also POS tag
The complete list of tags used in a corpus is called a tagset and can be accessed via the corpus info page.
lempos and lempos_lc
The lempos attribute was introduced mainly to make the computation of the word sketch and thesaurus possible. Lempos stands for lemma + POS. It is a combination of lemma and a one-word abbreviation of the part of speech. Parts of speech not supported by the word sketch all use the same suffix -x.
The lempos_lc or lempos (lowercase) is the lowercase version of lempos.
How to display attributes
Attributes can be viewed easily in the concordance. The concordance can be generated:
- from scratch using a concordance search,
- by jumping to the concordance via the local menu next to each result in other tools.
In the concordance, the view options offer the complete selection of attributes.