Learn to generate wordlists

in 3 minutes!

Word list is a generic name for all kinds of frequency lists that Sketch Engine can generate.

Word list examples

(click to enlarge)

How to generate a word list

The left menu contains two pre-selected options:

All words
to generate a frequency list of word forms in the corpus

All lemmas
will generate a frequency list of lemmas in the corpus


  • use the following settings to define what should be included in your word list

word list settings

(1) the word list can be generated from the whole corpus or a subcorpus only, select the subcorpus here, you can also get information about the subcorpus or create a new one from text types

(2) select what you want to count, whether word forms, lemmas or other attributes. The list of options depends on how the corpus is annotated but will generally include these options:
attributes: word form, tag, lempos, lempos-lc, lemma, word form (lowercase), lemma-lc
word sketch: collocations, terms – same as in the term extraction feature with the only difference that single and multiword terms are generated in one table
text types: text types depend on the corpus selected and will be different for each corpus

(3) tick this options to calculate frequencies of n-grams

(4) when ticked, at the end will be grouped under at the end of  because the 3-gram at the end is a sub n-gram of the 4-gram at the end of

Filter options

Exclude the items you are not interested in using the following filters:

(5) use regular expressions to limit the results to a certain pattern
simple example: ca.* produces a frequency list of words starting with ca
please refer to the examples further below

(6) use a limit to exclude low frequency words, use zero to include all words

(7) use a limit to exclude high frequency words

(8) if the frequency should be calculated only for a closed list of words, upload the list (whitelist) here, the file must be a plain text UTF-8 file with one word per line, the items must correspond to a selected attributes, e.g. when lemma is selected as an attribute, goes produces no result because it is not a lemma, when lempos is selected, all items must have a format of a lemma, i.e. go-v, money-n etc.

(9) use the blacklist to exclude a closed set of items from the frequency list

(10) when ticked, non-words will be included in the list (non-words are all tokens not starting with a letter, e.g. punctuations or numerals)

Output options

Here you can specify what should be displayed on the output screen.

(11) frequency figures
hit counts 
– the number of occurrences will be displayed next to each item
document counts – number of documents in the corpus where the item appeared at least once
ARF – average reduced frequency is a specialized statistic

(12) + (13) output type
– will produce a frequency list of all items matching the criteria
keywords – will only include keywords into the frequency list, i.e. specialized terminology related to the topic of the corpus more details. A reference corpus (14) has to be selected (leave the preselected one if not sure, the slider (15) can be used to influence to what extent more common (=less specialized) words should be included.

(16) the results can be calculated for certain attributes but different attributes can be displayed as output, e.g. frequencies can be calculated for lemmas but word forms can be displayed as output, up to 3 attributes can be displayed

Here are some examples of frequently used word list settings with regular expressions.

A list of nouns

Search attribute: lempos or lempos (lowercase)
Regular expression: .*-n
(-n might not be the noun suffix in all corpora, please refer to the Corpus details screen)

Note: the same result can be achieved by searching tags but lempos produce the results faster. To hide -n in the results, use Change output attributes: lemma or word

A list of 2- to 4-letter acronyms

The wordlist will contain all words written with 2 to 4 upper case letters.

Search attribute: word
Regular expression: [A-Z]{2,4}

A list of verbs and nouns beginning with re-

Search attribute: lempos
Regular expression: re-.*-[v|n]
(-n and -v might not be the right suffixes in all corpora, please refer to the Corpus details screen)

A list of all PoS tags except proper nouns (Penn Treebank tagset with modifications)

Search attribute: tag

Regular expression: (?!NP).*

Different corpora can have different part-of-speech tagsets. Please check the PoS tagset of your used corpus via Corpus info.

When you use the Change output attributes option, the frequencies may not be calculated from the whole corpus. With this option selected, it is compulsory to use a regular expression. First, a concordance for the words matching the regular expression is created and the frequency is calculated only from the first 10 million hits. If the corpus is large and there are more than 10 million hits matching the regular expression, the remaining hits will be ignored and not included in the word list.

Using a regular expression such as .* to match any word works exactly the same: a concordance will be created for the first 10 million words only (because any of the first 10 million words matches the regular expression). If the corpus is bigger than 10 million words, the rest of the corpus will not be included in the frequency. The output screen notifies you about this and offers the option of using random 10 million rather than the first 10 million lines.

Working with parts of speech

To restrict the word list to a specific part of speech (e.g. adjectives), there are two options:

corpus with lempos

If the corpus has a lempos attribute

  • select “lempos” as the search attribute
  • type .*-j (for adjectives, i.e. all lempos ending in -j) into the regular expression box

Lempos endings available for the corpus are listed on the corpus details screen.

corpus without lempos

If the corpus does not have the lempos attribute or you need to search something different than lemma

  • select “tag” as the search attribute
  • type the tag into the regular expression filter (V.* for verbs in English corpora)
  • select Change output attribute(s) and select one or more output attributes

Tags differ between languages and even between corpora in the same language. Tags available for the corpus are listed on the corpus details screen.

Word list limitations

There are limits to the length of word list that can be displayed and downloaded.