Regular expressions

Regular expressions are a convention of using some characters instead of unspecified letters or numbers. They are used to set criteria for strings of characters, e.g. words or tags, which have a common pattern, e.g. start the same way, finish the same way or contain certain characters.

Regular expressions are used mainly inside CQL, in word lists and n-grams.

This page only gives a few basic examples, please refer to Wikipedia, try our regular expressions exercises or this interactive course.

Wild cards

Wild cards are not regular expressions, but users know them from other software. They are only supported in the simple concordance search.

Using wild cards in simple concordance search

In simple concordance search only, the asterisk (*), question mark (?) and double dashes (--) can be used like this:

asterisk (*) stands for zero or more characters
test* will find
test, tests, tested, testing

c*t will find
CT, cut, cat, craft, construct

question mark (?) stands for exactly 1 character
test? will find
tests, Testa, testy
but will not find
test

c?t will find these lemmas
cat, cut
BUT! simple search always treats each search word as a lemma, thus c?t will search for the lemmas cut, cat and cot. These lemmas will produce results which include all word forms. The final concordance will thus show: cut, cutting, cat, cats, cot, cots, etc.

To search for the asterisk and question mark, use backslash (\) such as \* and \?

double dashes (--) stands for dash, space or none character
multi--million will find
multi-million, multi million, multimillion

vertical bar (|) stands for OR

cat|dog|horse will find

cat, dog, horse

Regular expressions (not wildcards!) are used in all the other concordance searches, in CQL to specify patterns for values and with wordlists to only include/exclude certain types of items.

Regular expressions and CQL

Regular expressions are used in CQL to specify patterns for values.

[word = “dis.*“] [tag = “V.*“] finds words beginning dis- followed by a verb

[tag=”J.*“] [word=”[[:upper:]]*“] finds adjectives followed by an acronym (=word in capitals)

To copy & paste, use these:

[word = "dis.*"] [tag = "V.*"]
[tag="J.*"] [word="[[:upper:]]*"]

Spaces in CQL and regular expressions

Spaces are used in CQL to make the code easier to read for the human eye. The use of spaces in CQL does not have any effect on the result.

In regular expressions, a space refers to a real space, e.g. space between two words. Since CQL criteria are set for individual tokens separately, the use of a space is generally a mistake and will not produce the required result.

CQL tutorial – introduction to corpus query language

Regex exercise

Learn regular expressions with our regex online tutorial.

Interactive lessons

dot ‘ . ‘

A dot stands for a single unspecified character.

regular expression	matching result(s)
w.n	win won wen wun wan
ca.	cat car cap cab can

question mark ‘ ? ‘

A question mark stands for zero or 1 occurrence of the preceding character

regular expression	matching result(s)
be?t	bt bet (but will not find beet beeet beeeet)
bet?	be bet (but will not find bets betting)
.?at	at hat bat cat mat (zero or one unspecified character at the beginning)

asterisk ‘ * ‘

An asterisk stands for zero or more occurrences of the preceding character.

regular expression	matching result(s)
co*l	CL col cool coool cooool
hallo*	hall hallo halloo hallooo halloooo
c.*ing	words startin with c- and ending with -ing (i.e. having any number of unspecified characters between c and ing) cycling camping cutting cooking contemplating
*ool	produces error, no character precedes the asterisk
c.*	word beginning with the letter c (c is followed by any number of any character)
.*ed	word ending with -ed (the word starts with any number of any character)

range ‘ [ ] ‘

use square brackets to specify a list or range
[bmpg] stands for b OR m OR p OR g
[a-d] stands for a letter between a and d
[3-5] stands for a digit between 3 and 5

regular expression	matching result(s)
[mpgb]et	met pet get bet
m[2-5]	m2 m3 m4 m5
m[2-5]*	m m22 m52 m3425 m23453234 m222345 (m followed by zero or more digits between 2 and 5)

not ‘ ^ ‘

use ^ to indicate that the character(s) should not be included, the characters have to be enclosed in square brackets

regular expression	matching result(s)
[^m]et	pet get bet let (but will not find met)
[^mpg]et	set let (but will not find met pet get)

letters and digits

letters can be specified by a range or by character class

regular expression	matching result(s)
[A-Z]	finds any upper-case character (of the English alphabet, not charactes such as é í č ß etc.)
[a-z]	finds any lowercase character (of the English alphabet)
[A-Za-z]*	finds any word consisting of upper-case and lowercase characters (of the English alphabet)
[[:alpha:]].*	finds a word consisting of letters of any alphabet including accented characters and special characters, see character classes further below

\d stands for a digit, i.e. characters 0-9, \D stands for any non-digit character

regular expression	matching result(s)
b\d	b1 b2 b3 b4
b\d*	b b1 b12 b89 b43958 (zero or more digits after b)
\d\db	58b 46b 89b (b preceded by two digits)

character classes

Character classes are special codes used to refer to a group of characters.

character class	meaning
[[:alpha:]]	any letter including accented and special characters, equivalent only for English is [A-Za-z]
[[:digit:]]	any digit, equivalent to [0-9] or d
[[:alnum:]]	any alphanumeric character, equivalent only for English is [0-9A-Za-z]
[[:lower:]]	all lower case characters [a-z]
[[:upper:]]	all upper case characters
[[:punct:]]	punctuation [-!”#$%&'()*+,./:;<=>?@[]_`{
[[:space:]]	whitespace character (space, new line, tab, carriage return)

Example:

[[:alpha:]]* finds all words composed of letters
[[:alpha:]][[:alnum:]]* finds all words starting with a letter and then composed of letters and numbers, eg. H2SO4 but not 4you

or ‘ | ‘

the pipe | is used to indicate OR

regular expression	matching result(s)
get\|met	will find lines which contain the word get OR the word met

plus ‘+’

the plus stands for ‘one or more repetitions of the preceding character’

regular expression	matching result(s)
hallo+	hallo halloo hallooo hallooooooooo (but not hall)
.+at	bat, great, format, cat (but not ‘at’, to include ‘at’, use .*at)

case sensitivity switch (?i)

regular expressions are always case sensitive, i.e. Bill is different from bill. To make the whole regular expression case insensitive, put these four characters at the beginning (?i)

regular expression	matching result(s)
(?i)monday	Monday monday MONDAY

repetition { }

use curly brackets to indicate repetition of the preceding character

regular expression	matching result(s)
halo{3}	halooo (exactly 3 repetitions of the letter o)
hallo{2,4}	haloo hallooo hallooo (from 2 to 4 repetitions of the letter ooo)
.{6}	anyone playmat bottle (words consisiting of any 6 characters, it is equivalent to typing 6 dots …… )
[a-z]{4,}	bake mother corporation (words consisting of 4 or more letters)

grouping ( )

any part of a regular expression can be surrounded by parentheses to make it a single unit onto which other regular expressions can be applied

regular expression	matching result(s)
(dis)?connect	connect disconnect (question mark makes the preceding element ‘(dis)’ optional)
(bla){3,4}	blablabla blablablabla

escaping

Characters such as . ? * have a special function in regular expression. To cancel this special function to search for the actual characters, place a backslash before them. This is called escaping (e.g. you have to escape a question mark) Characters $ and # in part of speech tags also have to be escaped. In CQL, double quotes " must be escaped [word="\""]

All of these characters need to be excaped if you want to search for the character:

. ^ $ * + ? ( ) [ ] { } | \

So to find a dot, the dot has to be escaped with a backslash: \.To find a backslash, the backslash has to be escaped with a backslash \\

regular expression

ok?

ok\?

matching result

a b c d e f g h etc. (all alphanumeric characters)

o ok (question mark makes the preceeding character optional)

ok?

produces error, backslash escapes the following character, it does not have any function on its own

not starting with ‘ ?! ‘

Use ?! to say “not starting with”, also called negative lookahead. The brackets are required. The brackets have to be followed by a regular expression defining what the token should consist of. Use .* for any token. Use … for 3-letter tokens. Use [[:upper:]]* for tokens consisting of uppercase characters, etc.

regular expression	matching result(s)
(?!NP).*	all POS tags not starting with NP
(?!th)…	all 3-character words not starting with “th”

backreferences

since manatee 2.65 It is possible to place brackets around one or several parts of a regular expression and refer to those parts later. The first part in brackets is referred to with number 1, the second with number 2, etc. (This only works within one token, e.g. [word=”(ba)..\1..*”] to find baseball, basketball, etc. N-grams tool supports also backreferences in different tokens, e.g. (.*) or \1 to find occurrences such as may or may, do or do, etc.

regular expression	matching result(s)
`(abra)kad\1` (the number must be escaped)	abrakadabra
`(a)(b)(c)\3\2\1`	abccab

back to Guide

Wild cards