GDEX configuration file is used for Good Dictionary Examples system which evaluates sentences with regard to their suitability to serve as dictionary or teaching examples. In this configuration, you can define the structure of good sentences how they should look like (e.g. specific sentence length, preferring frequent words) and filter out inappropriate sentences (e.g. too long sentences, sentences with vulgarisms, etc.)
GDEX configuration introduction
GDEX configuration files are written in YAML (Wikipedia.org) which is a human-readable (and human-editable) format also suitable for effective machine processing. The actual formula for calculating sentence scores is an expression in the Python programming language. Its syntax is limited to basic mathematical and logical operations, as well as function calls to pre-defined GDEX classifiers. Several variables such as the values of positional attributes are available. Named variables defined in the configuration file, typically regular expressions, can also be referenced in the formula.
The file contains two top-level keys – the mandatory formula and optional variables. Unless the formula fits on a single line, it must be preceded by the > YAML symbol for multi-line values. YAML does not allow the tab character, so you must use spaces for indentation. This is what a simple configuration file may look like:
formula: > (50 * is_whole_sentence() * blacklist(words, illegal_chars) * blacklist(lemmas, parsnips) + 50 * optimal_interval(length, 10, 14) * greylist(words, rare_chars, 0.1) * greylist(tags, pronouns, 0.1) ) / 100 variables: illegal_chars: ([<|\]\[>/\\^@]) rare_chars: ([A-Z0-9'.,!?)(;:-]) pronouns: PRON.* parsnips: ^(tory, whisky, cowgirl, meth, commie, bacon)$
The formula is supposed to evaluate a number between 0 (worst) and 1 (best). Values outside this range will be changed to the nearest limit.
Apart from the variables (actually constants) defined in the configuration, these important ones are available:
- length – sentence length (number of tokens including punctuation)
- kw_start and kw_end – a position of the keyword (range: 0–length)
- words, tags, lemmas, lemposs, lemma_lcs – a list of values for every positional attribute (attribute name + “s”)
- illegal_chars – sentences containing one or more of these characters are penalized
- rare_chars – sentences containing these characters are penalized, but less than the characters in the illegal_chars list
- pronouns – substituent, in this case, matches all tokens with the PoS tag PRON
- parsnips – a list of words from taboo topics, sentences with these words are penalized
This a general example of how variables in the GDEX configuration can work and can be named. You can name them differently or add more variables checking further sentence structure and features.
The attribute lists can be used as a whole (for example as a parameter to a classifier) or you can even access individual tokens using standard Python syntax. For example,
words is the first word in the sentence and
tags[-1] is the tag of the last token.
blacklist(tokens, pattern) returns either 1 if none of the tokens (e.g. words, lemmas etc.) matched pattern (regular expression), 0 otherwise
greylist(tokens, pattern, penalty) is similar to blacklist, but you can specify a penalty that will be subtracted from 1 for each token matching pattern down to 0. With a penalty of 1, it behaves as a blacklist.
optimal_interval(value, low, high) returns 1 if
value is between
high. Outside this range, the score linearly rises from 0 at
low/2 to 1 at
low and falls from 1 at
high to 0 at
value lower than
low/2 or higher than
2*high, the score is 0. Usually used with
is_whole_sentence() (mind the parentheses) returns 1 if the sentence starts with a capitalized word and ends with a full stop, question mark or exclamation mark. Otherwise, it returns 0.
word_frequency(word) returns the absolute frequency of the given word in the corpus.
word_frequency(word, normalize) returns the relative frequency per
normalize tokens. For example:
keyword_position() returns a number between 0 and 1, starting at zero for a keyword at the beginning of the sentence, and rising in equal increments (depending on the length of the sentence) to 1 for a keyword at the end of the sentence.
keyword_repetition() returns the number of occurrences of the keyword in the sentence.