Writing term grammar is rather meant for advanced users.
For better understanding, make sure that you know the syntax of Sketch grammar and are able to write a sketch grammar.
This page is a short manual to creating term grammars. A term grammar tells Sketch Engine which words and phrases should indentify as terms, e.g. a combination of preposition + verb + preposition will not be considered a valid term structure in most languages while adjective + (optional) adjective + noun will.
Generally, there is one term grammar for each language, however, additional grammars for domains requiring specific term descriptions can be easily produced.
The term grammar file is an input for the program which compiles terms (compilecorp or genws see compiling corpora). It is a text (ASCII) file containing CQL queries.
a hash # at the beginning of the line indicates a comment
a line with “=terms” must introduce term grammar relation(s)
there should be one CQL query per line, use a backslash / at the end of the line to split a CQL query into multiple lines
each position in a term grammar must be labelled, e.g. “1:noun”
lines beginning with an asterisk * are called processing directives
*STRUCTLIMIT s ensures that the query results appear inside the same structure (e.g. sentence). This ensures that tokens making up a term are all found inside the same sentence. This directive is active to the end of the file or to the next *STRUCTLIMIT directive.
*DEFAULTATTR tag sets the default attribute for query evaluation. This directive is active to the end of the file or to the next *DEFAULTATTR directive.
*COLLOC stands at the beginning of each line with a term grammar relation according to the pattern “*COLLOC “%(n.attr)”, where n is the numeric label used in the query, and attr is the attribute name, e.g. *COLLOC “%(1.gender_lemma)”
Structure of term grammar
Writing a term grammar is similar to the writing a sketch grammar. Generally, a term grammar consists of a heading and a term grammar.
Start the definition with a heading where you describe basic information as an author, date, version and POS tagset:
# Term Definition for Russian, RFTagger Multex East tagset
# by John Smith
# version 1.0
# Tagset doc: http://example.com/tagset.html
# - 17 January 2014, John Smith
Similar to sketch grammars, a term grammar is written in the m4 macro language. It helps to keep the grammar simple and easy to manage because syntax can be abbreviated. For examples with explanations see the Macros in m4 section on the Writing a Sketch grammar page.
Always use a .m4 filename extension when supplying a term definition in m4.
The following example shows macro term definitions. Macros are optional but recommended.
define(`noun_genitive',`[pos="N" & case="g"]')
define(`adj_genitive',`[pos="A" & case="g"]')
define(`agree',`$1.gender=$2.gender & $1.number=$2.number & $1.case=$2.case')
#macro definiton of agreement in grammatical categories of tokens (the line above)
Term grammar syntax
The example below identifies phrases such as “protected natural reserve”.
line: a definition of the whole form of a phrase by directive COLLOC. The phrase contains 3 words and each one of them in particular forms (attributes): gender respecting lemmas of the word with the label “2.”, gender respecting lemmas of the word with the label “3:”, lowercased form of the word with the label “1:”. Each word is written in the round brackets with the percentage “%” in front of them. The whole form of the phrase is closed in quotation marks. It is permitted to use only attributes used in the corpus.
(do not use the ending "-x" if word sketches in the corpus are based on lemmas instead of lemposes)
line: a query in the CQL language corresponding with the phrases that we want to cover by this rule. The query uses defined macros and it is expanded to 2:[pos="A"] 3:[pos="A"] 1:[pos="N"] & 1.gender=2.gender & 1.number=2.number & 1.case=2.case & 1.gender=3.gender & 1.number=3.number & 1.case=3.case Defined labels are used on the 2nd line. The label “1” specifies the main word of the phrase (called headword) and it is usually assigned to the most important noun in the phrase. You can check the correctness of the query by searching in the concordance search of Sketch Engine.
It is a good rule to write a comment with an example describing which terms are defined by this rule. We would recommend to write comments in English or bilingually in English and the language of term grammars.
Examples of term grammars
English term definition
This example of English term definitions can be a good starting point to writing term definitions for analytic or isolating languages.
# == Term extraction grammar for English ==
# version 2.4
# Based on WIPO from January 2013: (N|Adj)* N (of (N|Adj) N)*)
# 2015-12-02 MJ adopted for Susanne corpus
# 2013-03-27 VS created
# 2013-04-26 VS negative ending
# 2013-07-29 Revised according to final WIPO grammar (Vojta)
# 2013-08-01 added "-x" because of implicit WSSTRIP 2 (Vojta + VitS)
define(`modif',`[tag="NN.*" | tag="JJ" | tag="VVG.*"]')
3:modif 2:modif 1:noun
4:modif 3:modif 2:modif 1:noun
1:noun 2:wof 3:modif 4:noun
5:modif 4:modif 3:modif 2:modif 1:noun
2:modif 3:modif 1:noun 4:wof 5:noun
2:modif 1:noun 3:wof 4:modif 5:noun
1:noun 2:wof 3:modif 4:modif 5:noun
Slovenian term definition
This example of Slovenian term definitions can be a good sample how to define terms for inflected languages.