GDEX configuration files are written in YAML (Wikipedia.org). Thanks to this format, they are human-readable (and human-editable), but also suitable for effective machine processing. The actual formula for calculating sentence score is an expression in the Python programming language. Its syntax is limited to basic mathematical and logical operations, as well as function calls to pre-defined GDEX classifiers. Several variables such as the values of positional attributes are available. Named variables defined in the configuration file, typically regular expressions, can also be referenced in the formula.
The file contains two top-level keys — the mandatory formula and optional variables. Unless the formula fits on a single line, it must be preceded by the > YAML symbol for multi-line values. YAML does not allow the tab character, so you must use spaces for indentation. This is what a simple configuration file may look like:
formula: > (50 * is_whole_sentence() * blacklist(words, illegal_chars) * blacklist(lemmas, parsnips) + 50 * optimal_interval(length, 10, 14) * greylist(words, rare_chars, 0.1) * greylist(tags, pronouns, 0.1) ) / 100 variables: illegal_chars: ([<|\]\[>/\\^@]) rare_chars: ([A-Z0-9'.,!?)(;:-]) pronouns: PRON.* parsnips: ^(tory, whisky, jesus, cowgirl, meth, commie, bacon)$
The formula is supposed to evaluate to a number between 0 (worst) and 1 (best). Values outside this range will be changed to the nearest limit.
Apart from the variables (actually constants) defined in the configuration, these important ones are available:
- length — sentence length (number of tokens including punctuation)
- kw_start and kw_end — position of the keyword (range: 0–length)
- words, tags, lemmas, lemposs, lemma_lcs — a list of values for every positional attribute (attribute name + “s”)
The attribute lists can be used as a whole (for example as parameter to a classifier) or you can even access individual tokens using standard Python syntax. For example, words is the first word in the sentence and tags[-1] is the tag of the last token.
blacklist(tokens, pattern) returns either 1 if none of the tokens (e.g. words, lemmas etc.) matched pattern (regular expression), 0 otherwise
greylist(tokens, pattern, penalty) is similar to blacklist, but you can specify a penalty that will be subtracted from 1 for each token matching pattern down to 0. With a penalty of 1, it behaves as a blacklist.
optimal_interval(value, low, high) returns 1 if value is between low and high. Outside this range, the score linearly rises from 0 at low/2 to 1 at low and falls from 1 at high to 0 at 2*high. For value lower than low/2 or higher than 2*high, the score is 0. Usually used with length.
is_whole_sentence() (mind the parentheses) returns 1 if the sentence starts with a capitalized word and ends with a full stop, question mark or exclamation mark. Otherwise, it returns 0.
word_frequency(word) returns the absolute frequency of the given word in the corpus. word_frequency(word, normalize) returns the relative frequency per normalize tokens. Example: word_frequency(words, 1000)
keyword_position() returns a number between 0 and 1, starting at zero for a keyword at the beginning of the sentence, and rising in equal increments (depending on the length of the sentence) to 1 for a keyword at the end of the sentence.
keyword_repetition() returns the number of occurrences of the keyword in the sentence.