Find X (formerly called histograms) is a feature which enables you to see additional information in word sketch results. This information provides more details about the use of the word, e.g. noun people is usually used in the plural.
Find X (word sketch highlights) can show differences in word usage such as grammatical numbers (singular vs plural), text types (written vs spoken), grammatical cases (simple vs continuous vs passive), etc.
3 types of definitions
Find X can be defined in three ways.
i) a specified CQL query:
- (Q1) In this scenario, the frequency of the pattern specified by the CQL, with the word substituted at %s in Q1, is divided by the frequency of the word.
freq(Q1[word]) / freq(word)
ii) a comparison of two such CQL queries:
- (Q1 and Q2) In this scenario, the frequency of the Q1 query (with the word instantiated at %s) is divided by the sum of that same frequency and the frequency of Q2 (with the word instantiated at %s).
freq(Q1[word]) / (freq(Q1[word]) + freq(Q2[word]))
iii) a word sketch definition:
- (WS) Here the frequency of the word in the word sketch grammatical relation is divided by the frequency of the word in the entire corpus.
freq(WS[word]) / freq(word)
Q1 – CQL query (mandatory or S1 for parts of corpora)
Q2 – CQL query (optional)
S1 – part of corpus or subcorpus (mandatory or Q1 for queries)
S1 – part of corpus or subcorpus (optional)
HR – histogram human-readable name (optional)
RE – regular expression, e.g. n$ when use lempos attribute in Q1(optional)
TH – threshold (depending on the type of definition)
CL – coloring the information, e.g. red or blue (optional)
WS – word sketch definition name, e.g. usage patterns (mandatory if used)
How to use the Find X function?
This is a facility available from the left submenu in the word list feature and related to the use of word sketch highlights in Sketch Engine.
Additionally, a regular expression (RE) can be specified for removing some words from consideration. Only the words matching the RE are considered. This is mainly for efficiency reasons.
Examples are attached. Note that you may need to alter the minimum ratio and minimum frequency to see any results.
Definition file format
FindX (WS highlights) definition file format
=highlight_id HR human readable name Q1 query_1 Q2 query_2 # optional RE regular_expression # optional
=highlight_id HR human readable name WS wsdef_relation_name RE regular_expression # optional
# All strings in the definition files starting with # are comments and are ignored to the end of the line.
searching passive forms with using lempos attribute
HR verbs that are most often passive
Q1 [lempos=="%s" & tag="VBB_T"]
searching plural forms with using lempos attribute
HR nouns that are most often plural
Q1 [lempos=="%s" & tag="NNS_."]
searching with using threshold and colours
(50, 50, 100)
“spoken” should be replaced with the name of subcorpus (spaces are replaced with underscores)
Adam Kilgarriff and Pavel Rychlý (2008). Finding the words which are most X. In Proceedings of the 13th EURALEX International Congress. Spain, July 2008, pp. 433–436