Lemma is the basic form
of a word, typically the form found in dictionaries. A lemmatized corpus allows for searching for the basic form and include all forms of the word in the result, e.g. searching for lemma go
will find go
Lemma in Sketch Engine is case sensitive
are two different lemmas (City
= the City of London; city
= a common noun). The lemma of the first word of a sentence is always lowercase. Therefore, the search for lemma city
will also find City
but only in if City
appears at the beginning of a sentence.
A wordlist of lemmas
is a frequency list where all of go, went, gone, goes, going
are counted together and listed as go
A lemma search
will find all of go, went, gone, goes, going.
Capitalized word forms will only be included if found at the beginning of the sentence.
The concept of the lemma is not always clearly defined and may differ between languages (or even between two corpora in the same language). For example, in Sketch Engine, many, more, most
are three different lemmas in English. On the other hand, in Czech, the same adjective which is also irregular mnoho, více, nejvíce
share the same lemma hodně
The situation is even more complex with agglutinating languages such as Turkish, Hungarian or Japanese where it may not be easy to decide how many affixes should be removed to produce a lemma. The term stem often replaces the term lemma but stem often refers to the very core part of the word while several lemmas may share the same stem.
In Sketch Engine, all corpora in the same language are processed using the same tools and therefore have the same lemmatization. Rare exceptions exist if the corpus was acquired from external sources including the original lemmatization.
See also lemma-lc
or compare with word form.