Writing a Sketch Grammar

Word Sketch Grammar is a series of rules written in the CQL query language that search for collocations in a text corpus and categorize them according to their grammatical relations, e.g. objects, subjects, modifiers etc. The result is displayed in the form of a Word Sketch in the Sketch Engine interface.

Grammatical Relation Definitions

To build word sketches, we need to specify grammatical relations. For this, we need to provide a simple grammar – a collection of definitions that allow the system to automatically identify possible relations of words to the keyword.

Example

As an example, suppose the keyword is the verb “graze” and we are considering the following instances:

  • …as sheep graze a Gloucestershire pasture…
  • I’d hoped he’d still be grazing that pasture there.
  • …took the flock to the monsoon settlement to graze the mountain pastures.

We can capture the relation of object of the verb in the following pattern (using the Modified Penn Treebank Tagset):

  1:"VB.?" [tag="DT|PRP$"]{0,1} "JJ.?"{0,3} "NN.*"{0,2} 2:"NN.*"

This pattern captures cases where the keyword (indicated by the prefix “1:”) can be any verb (VBZ, VBD, VBN or VBG) followed possibly by a determiner or possessive pronoun (ie. where this tag occurs either 0 or 1 times), a string of 0-3 adjectives and 0-2 nouns and finally by a noun which is taken to be the head of the object noun phrase. The fact that it is this final noun that is the word we want to capture is indicated by the prefix “2:”. In all cases “pasture” will be recovered as the object of “graze”.

The attribute “tag” is taken as the default attribute and it can therefore be omitted (except in disjunctions).

We can add further definitions for the same relation to capture different constructions which realize the same underlying relation. For example, the subject of a passive verb plays the same role as the object of an active one:

  • …pastures would be grazed and never plowed.
  • When the original pasture is grazed again…

We can add to the definition of the gramrel, the query:

  2:"NN.*" [tag="RB"|tag="VM"]{0,4} [lempos="be-v"] 1:"VBN"

Here the verb in the passive construction (that is, a past participle following the verb “be”) is again marked as the keyword with the prefix “1:”. The subject is marked with the prefix “2:” and is thus taken to be the underlying object of the verb. Between the subject and the verb “be” we allow the possibility of a string of adverbs (“RB”) and/or modal verbs (“VM”).

Nota bene

Grammars that use the pattern-matching approach described here will always be less than perfect – there will be cases where they fail to capture the relation between two words, and cases where the grammar incorrectly supposes a relation exists. Such “noise” in the system is in most cases of little importance as the word sketches only display relations that occur much more often than expected. Therefore, one soon reaches a limit as to how further accuracy in the definitions improves the word sketch.

Grammatical Relations File

An input for the program which compiles the sketches (compilecorp or genws see compiling corpora) is a word sketch definition file. It is a text (ASCII) file containing queries for each grammatical relation (gramrel).

  • Comments are lines beginning with the hash character (#). Empty lines are ignored.
  • Lines beginning with the equal character (=) are gramrel names. A gramrel name can contain any character with the exception of slash (/) for dual gramrels (see below), trailing white spaces are stripped off.
  • The gramrel name is followed by gramrel queries, with each query on a separate line.
  • A regular gramrel query has to contain two labelled positions with labels “1:” and “2:”. One query should be on one line: use a backslash (\) on the end of a line to split a query into multiple lines.
  • Lines beginning with star (*) are processing directives. They modify handling of the lines that follow them:
    • *DEFAULTATTR sets the default attribute for query evaluation. This directive is active to the end of the file or to the next *DEFAULTATTR directive.
    • *STRUCTLIMIT limits query results to a structure, for example sentence. The sequence of tokens in the result cannot cross boundaries of the structure. This directive is active to the end of the file or to the next *STRUCTLIMIT directive.
    • *FIXORDER specifies the ordering of grammatical relations for display in the interface. It is possible to specify only the first n relation names; the rest will be sorted randomly.
    • *SYMMETRIC evaluates queries also with the “1:” and “2:” labels swapped. This directive is active up to the next gramrel line (a line starting with equal sign =)
    • *DUAL is similar to *SYMMETRIC but it affects gramrels. It defines two gramrels from the same set of gramrel queries. Gramrel names are separated by a slash (/). All queries are evaluated for the first gramrel and then for the second gramrel with the “1:” and “2:” labels swapped.
    • *UNARY says that the following gramrel is a unary relation. Only one label is used for unary gramrel queries.
    • *SEPARATEPAGE indicates that the following *TRINARY relation should be displayed on a separate page with links from the main wordsketch page. Optional parameter is the name of the aggregated gramrel name, defaults to the relation name with %s substituted to ‘*’.
    • *COLLOC specifies a created value for the collocation. It can contain ‘%’ substitution strings, in the form %(n.attr), where n is the numeric label used in the query, and attr is the attribute name. It uses the created value for the collocation instead of the attribute given by the WSATTR option. It should be put after gramrel name and before a particular query (can be different for different queries within the same gramrel). This directive is active up to the next gramrel line, or up to the next *COLLOC line.
    • *CONSTRUCTION indicates that the following gramrel should be displayed in the ‘Constructions’ list.
    • *TRINARY is used for trinary relations. These are translated into regular binary relations with different names. A name of a trinary gramrel should contain “%s” and respective queries should contain the third label “3:”. A value of the word sketch base attribute on the position labeled “3:” is then substituted for “%s” in the gramrel name. Note: Starting with Manatee 2.109, you should use the same attribute format %(n.attr) as is used for the *COLLOC directive, usually this means that instead of %s you should use %(3.lempos)(provided that lempos is the name of the attribute used for word sketches (set by WSATTR).
    • *WSPOSLIST determines list of parts of speech on the word sketch form (new in bonito 3.90), successor of WSPOSLIST option in configuration file. Should list all the parts of speech for which the grammar yields some hits (and nothing else). Format is the same as LPOSLIST and WSPOSLIST in the configuration file. Is relevant only if word sketch attribute (WSATTR) is lempos.
    • *UNIMAP says which relation should be mapped to another relation in a different language, it is used in bilingual word sketches e.g.
*DUAL =objet/objet_de *UNIMAP object/object_of

This means, that object should be joined with object and object_de should be joined with object_of (or the gramrels paired with English object and object_of in other languages). The algorithm for finding a target language (TL) gramrel to display next to a source language gramrel X is:

• if there is one or more TL gramrel with a UNIMAP value matching the UNIMAP value of X, select that one/them
• else if there is a TL gramrel of the same name, select that one
• else, nothing is aligned with X.

You find more in the paper Bilingual Word Sketches: the translate Button.

Example

The example is for French. We assume a default feature of tag and a lemma feature. The tagset is a simple one with

  • N for nouns
  • All verbs start with V. Past participles are V:pp, infinitives are V:inf
  • ADJ for adjectives
  • ADV for adverbs
  • DET for determiners
  • PRO for pronouns
  • PRP for prepositions

French words used: et (and) ou (or) de (of) ętre (the verb be) avoir (the verb have)

*STRUCTLIMIT s
*DEFAULTATTR tag
*FIXORDER subject subject_of object object_of
*WSPOSLIST ",noun,-n,verb,-v,adjective,-j,adverb,-d"

*DUAL
=object/object_of
	1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"
#"The first argument is a verb, then there are between 0 and 3 adverbs, 
#adjectives and determiners, then the second argument is a noun."  In 
# this simple example, no other constructions are covered.

*DUAL
=subject/subject_of
	2:"N" "ADV|PRO"{0,2} "V.*"{0,2}  "ADV|PRO"{0,2} 1:"V.*"
	2:"N"  "ADV|PRO"{0,2} 1:"V.*"
#First clause covers cases with auxiliaries, second covers simple verbs.

=and_or
*SYMMETRIC
	1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"
	1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"
	1:"ADJ" "AD[JV]"{0,3} [lemma="et|ou"|word=","] "AD[JV]"{0,3} 2:"ADJ"
	1:"ADJ" "NOM" 2:"ADJ"
#Conjunction: one clause each for nouns and verbs (simple cases only covered),
#two clauses for adjectives to cover the case where both adjectives are next 
#two each other, and the case where one comes before the head noun and the 
#other comes after.  Note that the comma (and other punctuation) is a regular 
#token which can be searched on (counter-intuitively) as a "word".

*DUAL
=adj_subject_of/adj_subject
	1:"N" "ADV"{0,2} [lemma="être"] "ADV"{0,2} 2:"ADJ" "[^AN].*"

*DUAL
=predicate_of/predicate
	1:"N"  "ADV"{0,2} [lemma="être"] "AD[JV]|DET"{0,3} 2:"N" "[^AN].*"
*DUAL
=modifier/modifies
	2:"ADJ"  "AD[JV]"{0,3} 1:"N"
	1:"N"  "ADJ"? 2:"ADJ"
	1:"V.*" 2:"ADV"
	2:"ADV" 1:"V.*"

=infin_comp
	1:"V.*" "ADV"{0,3} 2:"V:inf"

*TRINARY
*DUAL
=pp_%(3.lemma)/pp_of_%(3.lemma)
	1:"N|ADJ|V.*" 3:"PRP" "DET|ADJ"{0,3} 2:"N"

Example of usage for directives *CONSTRUCTION, *SEPARATEPAGE, *UNARY and *COLLOC:

*CONSTRUCTION
*UNARY
=wh_word
	1:[] [tag="AVQ"|tag="DTQ"|tag="PNQ"]

*SEPARATEPAGE pp_X
*TRINARY
=pp_%s
	1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.."


=pp_pp
*COLLOC "%(3.word)_%(2.word)-p"
	1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.."

Macros in m4

As we continue to expand the grammar to cover more relations and more patterns for each relation we will soon find ourselves repeating the same pattern many times. To keep the grammar simpler and more easy to manage we can write a macro for each recurring element in the language m4. So for the example,

  2:"NN.*" [tag="RB"|tag="VM"]{0,4} [lempos="be-v"] 1:"VBN"

we could define a noun phrase macro as follows:

  define(`noun_phrase',
         `"DT|PRP$"{0,1} "JJ.?"{0,3} "NN.*"{0,2} 2:"NN.*"')

We could also abstract away from the use of the actual tag for the lexical verb in our first definition by making the definition:

  define(`lex_verb', `"VB?"')

Writing grammars in this way also allows us to make them independent of any particular tagset. If we want to use a different tagset we simply need to redefine the basic definitions while the higher-level structures remain unchanged.

Using these two definitions we can now express our original clause for capturing the object of a verb as:

  1:lex_verb noun_phrase

In an m4 file the additional macro definitions are placed before the relation definitions and between the lines:

  divert(-1)
  ...
  divert

The program m4 (a standard Unix utility) is then run over the file to give a ‘full-form’ version which is used to build word sketches. Always use a .m4 filename extension when supplying a word sketch grammar in m4.