Using case sensitive and case insensitive searches with corpora

This blog post explains how to analyse corpora and take into account or ignore the difference between lowercase and uppercase. In other words, how to use Sketch Engine to:

type wifi and find wifi, WIFI, WiFi and Wifi

OR

type WiFi and only find WiFi but not the other variants

A short introduction to the lowercase attribute is required to fully understand how this can be achieved.

Lowercase

Lowercase is the key concept for case sensitive and case insensitive searches and analysis. When data are uploaded to Sketch Engine to build a corpus, they are automatically converted into several versions. To use the exact terminology, each token is assigned with several positional attributes. Attributes can be understood as corpus versions. Each column in the following vertical text represents a positional attribute (a version of the corpus).

The table shows some of the attributes (columns) into which this sentence would be converted:

The Cook Islands weren’t named after a cook but after James Cook who landed on the islands in 1773 to explore the land.

word lc lemma lemma_lc tag lempos lempos_lc
The the the the DT the-x the-x
Cook cook Cook cook NP Cook-n cook-n
Islands islands Islands islands NP Islands-n islands-n
were were be be VBD be-v be-v
n’t n’t not not RB not-a not-a
named named name name VVN name-v name-v
after after after after IN after-i after-i
a a a a DT a-x a-x
cook cook cook cook NN cook-n cook-n
but but but but CC but-c but-c
after after after after IN after-i after-i
James james James james NP James-n james-n
Cook cook Cook cook NP Cook-n cook-n
who who who who WP who-x who-x
landed landed land land VVD land-v land-v
on on on on IN on-i on-i
the the the the DT the-x the-x
islands islands island island NNS island-n island-n
in in in in IN in-i in-i
1773 1773 [number] [number] CD [number]-m [number]-m
to to to to TO to-x to-x
explore explore explore explore VV explore-v explore-v
the the the the DT the-x the-x
land land land land NN land-n land-n
. . . . SENT .-x .-x

The first attribute (column), called word, represents the text in its original form. No transformation is applied. The second attribute (column), called lc, lowercase or word (lowercase), is the same as word but converted into lowercase. All uppercase letters including ones in proper nouns and acronyms are lowercased (WiFiwifi, WIFIwifi, Parisparis, Hugohugo, UNESCOunesco). Similarly, lemma_lc and lempos_lc are the lowercased versions of the respective attributes. This blog post helps you understand all the different positional attributes.

The point of the lowercased attributes is to allow case insensitive searches and analysis when uppercase and lowercase variants of a token should be treated as the same thing.

How to switch to case insensitive

There are 2 ways to switch a tool in Sketch Engine into the case insensitive mode.

Option 1

Many tools have a case sensitivity switch, often found on the ADVANCED tab, not the SIMPLE tab.

non-lowercased search attribute when inactive, the search or analysis is case sensitive, it will use the non-lowercased attributes (word, lemma, lempos)

Searching for apple will finds apple. Searching for Apple, finds Apple.
A frequency list will contain a separate frequencies for Apple and apple.

search with lowercased attributes when active, the search or analysis is case insensitive, it will use the lowercased attributes (lc, lemma_lc, lempos_lc)

Searching for apple will find both Apple and Apple.
A frequency list will count the frequencies of apple and Apple together and will display one number next to apple.

Option 2

Some tools do not have the switch but the user can select the required attribute directly.

1
Frequency attribute
1

Selecting the lowercased attributes will perform the statistics in a case insensitive way. This means that the upper case and lower case versions of the same token will be counted together.

Typing words

When lowercase is selected, the input form will automatically adjust the input to match the setting. All of these options:

WIFI
WiFi
Wifi
wifi

will be lowercased first and will produce the the same as typing wifi.

Tools in detail

Certain tools and operation have a predefined attribute to work with and the user cannot change it. This is how individual tools behave with regard to case sensitivity:

Word sketch

The word sketch always uses a predefined attribute, typically the lempos. The attribute is defined in the sketch grammar. The user cannot change the attribute on the fly. Word sketches are precalculated during compilation and changing the attribute would require recalculation. For user corpora, the user can write their own sketch grammar that uses a different attribute.

With lempos, apple will produce different collocations from Apple. Combined collocations for Apple and apple cannot be displayed.

With lempos_lc (only possible if the user writes their own sketch grammar based on this attribute), apple produces combined collocations for both apple and Apple. Typing Apple will not produce any results because lempos_lc does not contain any lemmas starting with an uppercase letter.

Word sketch difference

The information for the word sketch above applies to word sketch difference too.

Synonyms and antonyms

Thesaurus

The thesaurus is based on comparing word sketches and therefore always uses the same attribute as the word sketch. To change the attribute for the thesaurus, the attribute for word sketch should be changed in the sketch grammar.

With the attribute set to lempos, apple and Apple will produce different lists of synonyms.

Concordance and Parallel concordance

Simple search searches simultaneously in several attributes, typically word, lowercase and lemma. For user corpora, this can be in the corpus configuration file.

Other searches use the non-lowercased search attribute switch to activate the use of lowercased attributes.

CQL search – the attribute is set individually for each token.

Concordance result screen

The tools for working with the concordance result, located in the toolbar above the concordance lines, contain the BASIC and ADVANCED tabs. The latter contains either the non-lowercased search attribute switch or the attribute selector for switching between case sensitive and case insensitive.

frequency wordlists

Wordlist

On the ADVANCED TAB, use  non-lowercased search attribute  or select the required attribute. Additional options such as starting with/contianing/ending with must match the selected attribute. This is how the combinations of settings affects the result:

input of:
starting with
containing
ending with
from this list
attribute non-lowercased search attribute result note
apple word apple will not find Apple
Apple word Apple will not find apple
apple word ? apple results include both Apple and apple but are displayed as apple
Apple word ? apple as above; the interface will lowercase the input

Use Display as to display a different result.  For example, these criteria:

apple word ? apple results include both Apple and apple but are displayed as apple

normally count Apple and apple together and display it as apple. Set Display as: to word to display a separate result for Apple and for apple.

N-grams - multiword expressions, MWEs

N-grams

On the ADVANCED TAB, use non-lowercased search attribute or select the required attribute.

Keywords and terms

Keywords
By default, the word attribute is used. It can be changed on the advanced tab.

Terms
The attribute is defined in the term grammar, it is usually the lemma and cannot be changed by the user.

Trending words, neologisms

Trends

The attribute can be selected on both the BASIC and ADVANCED tabs.

See also

Words, tags, lemmas, lemposes, lowercase – explanation of all attributes in the corpus

POS tags – explanation of part-of-speech tags

Topic classification
corpus from the web
blog: pos tags

POS tags

OneClick Terms - multi-word term extraction
Screenshot of thesaurus from esTenTen Spanish corpus

Automatic thesaurus