Case senstive & insensitive corpus analysis

Using case sensitive and case insensitive searches with corpora

This blog post explains how to analyse corpora and take into account or ignore the difference between lowercase and uppercase. In other words, how to use Sketch Engine to:

type wifi and find wifi, WIFI, WiFi and Wifi

type WiFi and only find WiFi but not the other variants

A short introduction to the lowercase attribute is required to fully understand how this can be achieved.

Lowercase

Lowercase is the key concept for case sensitive and case insensitive searches and analysis. When data are uploaded to Sketch Engine to build a corpus, they are automatically converted into several versions. To use the exact terminology, each token is assigned with several positional attributes. Attributes can be understood as corpus versions. Each column in the following vertical text represents a positional attribute (a version of the corpus).

The table shows some of the attributes (columns) into which this sentence would be converted:

The Cook Islands weren’t named after a cook but after James Cook who landed on the islands in 1773 to explore the land.

word	lc	lemma	lemma_lc	tag	lempos	lempos_lc
The	the	the	the	DT	the-x	the-x
Cook	cook	Cook	cook	NP	Cook-n	cook-n
Islands	islands	Islands	islands	NP	Islands-n	islands-n
were	were	be	be	VBD	be-v	be-v
n’t	n’t	not	not	RB	not-a	not-a
named	named	name	name	VVN	name-v	name-v
after	after	after	after	IN	after-i	after-i
a	a	a	a	DT	a-x	a-x
cook	cook	cook	cook	NN	cook-n	cook-n
but	but	but	but	CC	but-c	but-c
after	after	after	after	IN	after-i	after-i
James	james	James	james	NP	James-n	james-n
Cook	cook	Cook	cook	NP	Cook-n	cook-n
who	who	who	who	WP	who-x	who-x
landed	landed	land	land	VVD	land-v	land-v
on	on	on	on	IN	on-i	on-i
the	the	the	the	DT	the-x	the-x
islands	islands	island	island	NNS	island-n	island-n
in	in	in	in	IN	in-i	in-i
1773	1773	[number]	[number]	CD	[number]-m	[number]-m
to	to	to	to	TO	to-x	to-x
explore	explore	explore	explore	VV	explore-v	explore-v
the	the	the	the	DT	the-x	the-x
land	land	land	land	NN	land-n	land-n
.	.	.	.	SENT	.-x	.-x

The first attribute (column), called word, represents the text in its original form. No transformation is applied. The second attribute (column), called lc, lowercase or word (lowercase), is the same as word but converted into lowercase. All uppercase letters including ones in proper nouns and acronyms are lowercased (WiFi⇢wifi, WIFI⇢wifi, Paris⇢paris, Hugo⇢hugo, UNESCO⇢unesco). Similarly, lemma_lc and lempos_lc are the lowercased versions of the respective attributes. This blog post helps you understand all the different positional attributes.

The point of the lowercased attributes is to allow case insensitive searches and analysis when uppercase and lowercase variants of a token should be treated as the same thing.

How to switch to case insensitive

There are 2 ways to switch a tool in Sketch Engine into the case insensitive mode.

Option 1

Many tools have a case sensitivity switch, often found on the ADVANCED tab, not the SIMPLE tab.

	when inactive, the search or analysis is case sensitive, it will use the non-lowercased attributes (word, lemma, lempos) Searching for apple will finds apple. Searching for Apple, finds Apple. A frequency list will contain a separate frequencies for Apple and apple.
	when active, the search or analysis is case insensitive, it will use the lowercased attributes (lc, lemma_lc, lempos_lc) Searching for apple will find both Apple and Apple. A frequency list will count the frequencies of apple and Apple together and will display one number next to apple.

Option 2

Some tools do not have the switch but the user can select the required attribute directly.

Typing words

When lowercase is selected, the input form will automatically adjust the input to match the setting. All of these options:

WIFI
WiFi
Wifi
wifi

will be lowercased first and will produce the the same as typing wifi.

Tools in detail

Certain tools and operation have a predefined attribute to work with and the user cannot change it. This is how individual tools behave with regard to case sensitivity:

Word sketch

The word sketch always uses a predefined attribute, typically the lempos. The attribute is defined in the sketch grammar. The user cannot change the attribute on the fly. Word sketches are precalculated during compilation and changing the attribute would require recalculation. For user corpora, the user can write their own sketch grammar that uses a different attribute.

With lempos, apple will produce different collocations from Apple. Combined collocations for Apple and apple cannot be displayed.

With lempos_lc (only possible if the user writes their own sketch grammar based on this attribute), apple produces combined collocations for both apple and Apple. Typing Apple will not produce any results because lempos_lc does not contain any lemmas starting with an uppercase letter.

about word sktech

Word sketch difference

The information for the word sketch above applies to word sketch difference too.

about word sktech difference

Thesaurus

The thesaurus is based on comparing word sketches and therefore always uses the same attribute as the word sketch. To change the attribute for the thesaurus, the attribute for word sketch should be changed in the sketch grammar.

With the attribute set to lempos, apple and Apple will produce different lists of synonyms.

about thesaurus

Concordance and Parallel concordance

Simple search searches simultaneously in several attributes, typically word, lowercase and lemma. For user corpora, this can be in the corpus configuration file.

Other searches use the switch to activate the use of lowercased attributes.

CQL search – the attribute is set individually for each token.

Concordance result screen

The tools for working with the concordance result, located in the toolbar above the concordance lines, contain the BASIC and ADVANCED tabs. The latter contains either the switch or the attribute selector for switching between case sensitive and case insensitive.

about concordance

about parallel concordance

Wordlist

On the ADVANCED TAB, use or select the required attribute. Additional options such as starting with/contianing/ending with must match the selected attribute. This is how the combinations of settings affects the result:

input of: starting with containing ending with from this list	attribute		result	note
apple	word	☐	apple	will not find Apple
Apple	word	☐	Apple	will not find apple
apple	word	?	apple	results include both Apple and apple but are displayed as apple
Apple	word	?	apple	as above; the interface will lowercase the input

Use Display as to display a different result. For example, these criteria:

apple

word

apple

results include both Apple and apple but are displayed as apple

normally count Apple and apple together and display it as apple. Set Display as: to word to display a separate result for Apple and for apple.

about wordlist