Sketch Engine provides several methods whose results can be controlled with many parameters. In this section, all methods are listed and parameters specific to each method are described. The “universal attributes” (that can be used with all methods) are described below. Note that some characters (e.g. space) must be escaped. For more information, see e.g. http://en.wikipedia.org/wiki/Percent-encoding.

The structure of the output of these methods can be found on the JSON API documentation page.

Universal attributes

There are a few attributes that can be used with any method.

Parameter Type Default Description
corpname string REQUIRED corpus name (in the short form, e.g. ‘bnc’) which will be processed. You can query your own corpus (e.g. username john, corpus mycorpus), just use value user/john/mycorpus
usesubcorp string empty name of a subcorpus that will be processed. Default is empty which means working with the entire corpus
format string empty the format of the output; empty value is interpreted as JSON; various methods differ in supporting output (export) formats but most of them allow: json, xml, csv, txt and xlsx. JSON is default format of Sketch Engine API.
json JSON all input attributes encoded as a string in JSON

corp_info

Provides detailed information about the corpus, lexicon sizes etc.

Parameter Type Default Description
gramrels boolean 0 get list of grammar relations from the respective word sketch grammar
corpcheck boolean 0 get output from last corpcheck (if available in compilation log)
registry boolean 0 get registry file content and settings from manatee (might differ)
subcorpora boolean 0 get list of subcorpora and their respective sizes (in tokens, words)
struct_attr_stats boolean 0 get structure attributes, and their lexicon sizes
import requests

requests.get('run.cgi/corp_info?corpname=bnc2;gramrels=1;subcorpora=1')
{
 "info": "Balanced English corpus ...",
 "encoding": "UTF-8",
 "compiled": "12/07/2016 14:31:51",
 "unicameral": false,
 "alsizes": [],
 "tagsetdoc": "https://...",
 "gramrels": [],
 "structs": [],
 "wposlist": [
 ["adjective", "AJ."],
 ...
 ],
 "lang": "English",
 "name": "British National Corpus (BNC) ...",
 "sizes": {
 "tokencount": "112181015",
 "sentcount": "6052184",
 "normsum": "96052598",
 "parcount": "1514906",
 "doccount": "4054"
 },
 "subcorpora": [],
 "infohref": "http://...",
 "lposlist": [
 ["adjective", "-j"],
 ...
 ],
 "attributes": [
 {
 "fromattr": "",
 "id_range": 0,
 "dynamic": "",
 "name": "word",
 "label": ""
 },
 ...
 ]
}

wordlist

This method provides the functionality of “word list” and “keywords” functions that are normally available under the link “Word List” in the web interface.

Parameter Type Default (example) Description
wltype string REQUIRED ‘simple’ for normal word list, ‘keywords’ for comparing with a reference corpus (see ref_corpname)
wlattr string (REQUIRED) REQUIRED (‘word’) corpus attribute you want to work with

the value 'WSCOLLOC' is used for word sketch format AS A LIST

wlnums string frq defines a type of frequency figures; possible values are: frq, docf, arf (for word frequency, document frequency and ARF, respectively). If more than one value is used, they must be separated by ‘,’.
wlminfreq integer 5  minimum frequency in the corpus
wlmaxitems integer 100 the actual value can be limited depending on a corpus, but is not limited for user corpora
wlpat string .* RE pattern for positive filtering; relevant only in simple word list
wlsort string w if ‘frq’, resulting word list will be sorted by frequency, else according to attribute; relevant only in combination with simple word list
ref_corpname string REQUIRED corpus name (in the short form, e.g. ‘bnc2’) of the reference corpus; relevant only in combination with “keywords” function
ref_usesubcorp string reference subcorpus name; relevant only in combination with “keywords” function.
wlfile file allows to send a file with allowlist via POST request
wlblacklist file allows to send a file with denylist via POST request

attr_vals

Returns a list of values for a given structure attribute.

Parameter Value (default) Default Description
avattr string (REQUIRED) REQUIRED structure attribute
avpat string .* Substring to be searched in RE. Default value will search for all patterns
avmaxitems integer 0 maximum items to be returned
avfrom integer 0 start from nth item

run.cgi/attr_vals?corpname=bnc2;avpat=.*br.*;avattr=u.who

{
 "query": "br",
 "suggestions": ["PS4BR", "PS3BR", "PS2BR", "PS1BR"],
 "no_more_values": true
}

wsketch

Word sketch method for retrieving a survey of a word’s collocational behavior.

Parameter Type Default Description
lemma string REQUIRED lemma, basic wordform
lpos string auto part of speech in notation ‘-n’, ‘-v’, … but the particular notation depends on a corpus. If the corpus contains “lempos” attribute and lpos attribute is omitted, it is automatically replaced by the most frequent lpos for the specified lemma. Otherwise, it has no effect.
sort_gramrels boolean (integer) 1 sort grammatical relation
minfreq integer, auto auto minimum frequency of a collocate. ‘auto’ is a function of corpus size
minscore float 0.0 minimum salience of a collocate
maxitems integer 25 maximum number of items in a grammatical relation
clustercolls integer  0  cluster collocations
minsim float 0.15 minimum similarity between clustered items, relevant only when clustercolls is set to 1
expand_seppage boolean (integer) 0 expand SEPARATEPAGE grammar relations and their collocations; useful for exporting word sketch data
#!/usr/bin/python

import time
import requests

# get your API key here: https://app.sketchengine.eu/ in My account
USERNAME = ''
API_KEY = ''
BASE_URL = 'https://api.sketchengine.eu/bonito/run.cgi'

for item in ['make', 'ensure']:
    d = requests.get(BASE_URL + '/wsketch', auth=(USERNAME, API_KEY), params={
        'lemma': item,
        'lpos': '-v',
        'corpname': 'preloaded/bnc2',
        'format': 'json',
    }).json()
    print 'Word sketch data for', item
    for g in d['Gramrels'][:3]:
        print '    ' + g['name']
        for i in g['Words'][:3]:
            print '        ' + i['word']
    # beware of FUP, see https://www.sketchengine.eu/service-level-agreement/
    time.sleep(5)
Word sketch data for make
    subject
        decision
        company
        God
    object
        decision
        sense
        use
    usage patterns
        np_pp
        passive
        Sfin
Word sketch data for ensure
    subject
        arbitrage
        draftsman
        tenant
    object
        compliance
        survival
        continuity
    usage patterns
        Sfin
        np_pp
        passive

thes

Thesaurus list.

Parameter Type Default Description
lemma string REQUIRED
lpos see wsketch
maxthesitems integer 60 maximum number of items
clusteritems integer (boolean) 0 see wsketch
minsim see wsketch

wsdiff

This method provides Sketch difference.

  • lemma – first lemma. This attribute is required.
  • lemma2 – second lemma. This attribute is required.
  • lpos – part of speech in notation ‘-n’, ‘-v’, … (but the particular notation depends on corpus). If the corpus contains the “lempos” attribute, it is required, else it has no effect.
  • sort_gramrels – “sort grammatical relation” mark. Values ‘0’, ‘1’ (default)
  • separate_blocks – “separate blocks” mark. ‘1’ (default) = “common/exclusive blocks”, ‘0’ = “all in 1 block”
  • minfreq – minimum frequency in corpus. Default is ‘auto’ that is a function of corpus size. Other possible values are natural numbers.
  • maxcommon – maximum number of items in a grammatical relation of the common block (default 25)
  • maxexclusive – maximum number of items in a grammatical relation of the exclusive block

view

This method provides access to concordance lines and all possibilities of sorting, sample selecting and filtering of them. It operates in two modes:

  1. asynchronous (default) – the computation is started in background and the request returns immediately or as soon as the required number of concordance lines is available. The required number of concordance lines is the product of the fromp and pagesize attributes (see below).
  2. synchronous – the request does not return until the whole concordance is computed. To enable this, pass asyn=0 or (for bonito versions older than 5) async=0.

The basic attribute is the q attribute that contains a list of search queries, that are processed incrementally. A list of queries can be transferred through the CGI interface as ‘q=item1;q=item2…’; another possibility is to use the JSON interchange format, see the following sections. The first query specifies the basic search query, the next ones specify sorting and filtering options. The construction of a query is not trivial and therefore we will describe it here more precisely. The content of the q attribute is a string of the following structure:

〈query_sign〉〈query〉

where 〈query_sign〉 specifies the type of query and it is one char from the set {‘q’, ‘a’, ‘r’, ‘s’, ‘n’, ‘p’, ‘w’} (‘q’, ‘a’ and ‘w’ queries can be used as the basic search query, the others behave as filters). The rest of the query depends on the 〈query_sign〉, as follows.

Basic search queries:

  • q – is followed by a common CQL query with all its possibilities. Examples:
q[lemma="drug"]
q[lemma="drug"][lemma="test"] within <s/>
q[lemma="drug"][lemma="test"] within <s/>

(there is no difference between the last two examples, they just demonstrate that spaces can be used within the CQL query but they are not required)

  • a – the same like q but it is possible to specify the default attribute. Syntax and example:
a〈 default_attribute〉,〈CQL_query〉
---------------------------------
alemma,"drug" [tag="N.*"]
  • w – query from Word Sketch. This is used in links from word sketch tables to concordances. The ‘w’ character is followed by a number ID that specifies lines that match a particular word sketch relation. The ID can be pulled from the field ‘seek’ in the Word Sketch JSON output (see the next sections). More comma-delimited IDs can be specified; in this case, the result is union. Example:
w4816743
w,4816743,4816826

Sorting and filtering options:

  • r – selecting a random sample from the concordance. The ‘r’ character is followed by a natural number or percentage that specifies the size (number of lines) of the sample. Examples:
r250
r20%
  • s – sorting the concordance. Syntax:
s〈attribute〉/〈marks〉〈space〉〈sort_range〉
s〈attribute〉/〈marks〉〈space〉〈sort_range〉〈space〉〈attribute〉/〈marks〉〈space〉〈sort_range〉
s〈attribute〉/〈marks〉〈space〉〈sort_range〉〈space〉〈attribute〉/〈marks〉〈space〉〈sort_range〉〈space〉〈attribute〉/〈marks〉〈space〉〈sort_range〉

The first three patterns stand for sorting options available under the “Sort” menu in the web interface. As can be seen from the patterns 2 and 3, also the multilevel sorting options are available.
Legend to the first three patterns:

  • 〈attribute〉is the particular corpus attribute used. It can also be a structure attribute, e.g. ‘doc.id’ for sorting according to the document IDs.
  •  〈mark〉can be ‘i’, ‘r’, ‘ir’ or empty (“”) which means “ignore case”, “reverse order”, both of them or none of them
  • 〈space〉is the space character (‘ ‘)
  • 〈sort_range〉 is either a position or a range.
  • Positions can be referenced as follows:
    • integer number – where 0 is the first token in KWIC, -1 the rightmost token in the left context etc.
    • 1:x – where x is one of the corpus structures (e.g. “doc” or “s” if the corpus has the particular markup). Its meaning is the first token in the structure, except when it is the right boundary of a range – then it is the last token in the structure. Also, other numbers can be used, e.g. -2:x, 3:x, etc. (-1 is the same as 1 with meaning “structure containing KWIC”)
    • a<0 – where ‘a’ stands for a position reference as described in the first two points with meaning “‘a’ positions before/after the first KWIC position” (so this is equivalent to ‘a’)
    • a>0 – where ‘a’ stands for the same position reference with meaning “positions before/after the last KWIC position”
    • in the previous two points, if ‘0’ is substituted with a natural number ‘k’, it means “before/after ‘k’-th collocation” instead of “before/after KWIC”. Collocations are special token groups in the context, that can be added using positive filters (see below)

    Ranges can be referenced as a~b where ‘a’, ‘b’ stand for token identifiers as above. Examples of positions and ranges:

    • -1<0 – rightmost token in the left context
    • 3>0 – third token in right context
    • 0>0 – last token in KWIC
    • 0<0 – first token in KWIC
    • 0<0~0>0 – range of KWIC
    • -1<0~1>0 – range of KWIC with one token from the left context and one from the right context
    • 1:s – first token in the sentence containing KWIC (or its first token)
    • 1:s>0 – first token in the sentence containing KWIC (or its last token)
    • 0<1 – first token of the first-added collocation

Examples:

sword/ 1>0~3>0
sword/ 1>0~3>0
slemma/ 0<0~0>0
sword/i -1
sword/ 0 word/ir -1<0 tag/r -2<0
  • n – negative filter. Syntax:
n〈position〉〈space〉〈position〉〈space〉〈selected_token〉〈space〉〈CQL_query〉
  • where:
    • 〈position〉stands for position reference as explained in the “s” section
    • 〈space〉is the space character
    • 〈selected_token〉 stands for “selected token”. Values ‘-1’ = last, ‘1’ = first
    • 〈CQL_query〉stands for a query that – if found between the two specified positions – filters out the particular line of the concordance

Examples:

n-5 -1 -1 [lemma="drug"]
n-5 -1 -1 [lc="drug" & tag="J.*"]
  • p – positive filter; similar to the negative filter above. Syntax and example:
p<position><space><position><space><selected_token><space><CQL_query>
-----------------------
p-1 -1 -1 [word="drug"]
  • F – filtering the first occurrences of a query within a structure. Syntax and example:
F<structure>
-----------------------
Fbncdoc
  • e – sorting lines by GDEX scores

The syntax: e where is the number of concordance lines to be sorted. The higher the number, the slower is the computation so it is advisable to use numbers well below 1,000. A sample ( lines) of the whole concordance is scored with a default GDEX configuration and sorted by the GDEX scores. The rest of the concordance is appended to the sorted sample.

The default configuration can be changed by appending a custom GDEX configuration filename to the query (e.g. e100+myGDEXconfigFile where the plus symbol is URL-encoded white space).

If you want to get the GDEX scores in the result, use E instead of e.

Note that asynchronous query processing (turned off with asyn=0 or, for bonito versions older than 5, async=0) might impact the result.

Other attributes of the “view” method
  • asyn – if set to 1 the result is processed asynchronously which means that you obtain an initial part of the result before the complete result is computed; by repeating the same call you may receive a bigger result; once the query is fully processed, you receive finished: 1 in the result. In majority of cases it is recommended to turn it off (asyn=0); default 1
  • pagesize – size (number of lines) of the resulting concordance. Default 20
  • fromp – number of the page that is returned. Default 1
  • kwicleftctx – size of the left context in KWIC view. Can be expressed as:
    • 〈number〉 – number of tokens (should be negative in the left context)
    • 〈number〉 – a positive number of characters (note that the ‘#’ character must be escaped in URLs), e.g. ’40#’ (default value)
    • 〈structure_number〉:〈tag〉 – structural context, e.g. ‘-1:s’ stands for left context of the whole sentence. In the left context, 〈structure_number〉 should be negative
  • kwicrightctx – size of the right context, similar to kwicleftctx. Both 〈number〉 in the number of tokens and 〈structure_number〉 in the case of structural notation should be positive
  • viewmode – “KWIC” / “sentence” view mode. Values: ‘kwic’ (default), ‘sen’
  • attrs – comma-delimited list of attributes that are returned for KWIC tokens. The set of available attributes depends on the corpus. Examples of values: ‘word’ (default), ‘word,lemma’, ‘lemma,tag,word’ etc.
  • ctxattrs – comma-delimited list of attributes that are returned for context tokens. Examples of values: ‘word’ (default), ‘word,lemma’, ‘lemma,tag,word’ etc.
  • structs – comma-delimited list of structure tags that are returned/applied. Default: ‘p,g’
  • refs – comma-delimited list of items returned in the “references” field. Default is ‘#’ that stands for token number or value of option SHORTREF defined in the corpus configuration file. Other possible values are:
<attribute>
=<attribute>
  • where  〈attribute〉is an attribute of one of the corpus structures, e.g. doc.id s.n … The first notation displays the information in name=value format, the second one returns only the value.

freqs

This method provides access to the frequency statistics.

Attributes:

  • q – query list, the same as for the “view” method. This attribute is required.
  • fcrit – object of frequency query, i.e. “frequency of what are you looking for?” (This attribute is required.) Syntax of values of this attribute is very similar to the sorting queries with the “view” method:
<attribute>/<marks><space><sort_range>
<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
  • with all being the same as at sorting options except that can be only ‘i’ or empty (“”) Examples of possible values with explanation:
    • tag 0~0>0 – frequency of tags of all KWIC tokens
    • tag 0 – frequency of tags of first KWIC tokens
    • word/ 0 lemma/i -1<0 – (multilevel) frequency of first word in KWIC and last lemma in the left context (with ignored case on)
  • fcrit can be also a list, if so, the output contains more blocks.
  • flimit – frequency limit. Default 0
  • fmaxitems – max. number of lines on the output. Default 50
  • freq_sort – identifier of column according to which should be the output sorted (its number counted from 0) or ‘freq’ (default), that means sorting according to frequency

Note: freqml, freqtt are methods return the same output as the freqs method with using attributes and values taken from forms provided by the freq method. For more details about working these methods, see the note for the view method.

collx

This method provides collocation candidates computation.

Attributes:

  • q – query list, the same as for the “view” method. This attribute is required.
  • cattr – corpus attribute that is the computation performed over. Default is ‘word’
  • cfromw – search range – “from” – in token index (only integer numbers allowed). Default -5
  • ctow – search range – “to” – similar. Default 5
  • cminfreq – minimum frequency in corpus. Default 5
  • cminbgr – minimum frequency in given range. Default 3
  • cmaxitems – maximum number of displayed lines. Default 50
  • cbgrfns – list of displayed functions in the output result: cbgrfns=f1;cbgrfns=f2;… Default [‘t’, ‘m’]
  • csortfn – function according to which the result is sorted. Default ‘f’.

Notation of the functions:

  • t – T-score
  • m – MI
  • 3 – MI3
  • l – log likelihood
  • s – min. sensitivity
  • p – MI log frequency
  • r – relative frequency
  • f – absolute frequency
  • d – logDice

Note: coll method (that returns the collocation candidates input form) is related to this method.

save* methods

This group of methods is now obsolete. It included: savecoll, saveconc, savefreq, savethes, savewl, savews. Use these methods instead: collx, view, freqs, thes, wordlist, wsketch together with parameter format (values json, xml, xls, csv, tsv, txt). Mind that not all combinations are available. E.g. txt format will work with view method, but not with wsketch.

subcorp

This method performs creation and deletion of subcorpora.

Attributes:

  • subcname – name of the new subcorpus (or subcorpus being deleted respectively). Default None (no operation with subcorpora).
  • delete – if not empty (that is default), delete subcorpus instead of creation it
  • corpus structural attributes and their values can be here used as attributes and values of the method. The selected values define the span of the subcorpus.

extract_keywords

This method allows you to extract single-word keywords as well as multi-word terms from your (user) corpora. This method might take some time, so be patient especially for large reference corpora.

For extraction multi-word units (also called terms), you need to use the parameter "attr" with the value "TERM", i.e. "attr": "TERM"

Parameter Type Default Description
corpname string REQUIRED Corpus name of the focus corpus.
ref_corpname string REQUIRED Corpus name of the reference corpus, it must have the same processing (the same attributes, the same term grammar).
simple_n float 1.0 Simple math parameter for the extraction.
attr string word Which attribute to use for the extraction (usually available are word, lemma, lc) or you can the value TERM for term extraction 
stopwords integer 0 Whether to filter out words from stoplist (not implemented yet).
alnum integer 0 Whether all characters should be alphanumerical.
onealpha integer 1 Whether items should contain at least one alphanumerical character.
minfreq integer 5 Minimum frequency of items in the response.
max_keywords integer 100 The number of items to be returned in the response.
https://api.sketchengine.eu/bonito/run.cgi/extract_keywords?attr=word&corpname=preloaded/aclarc_1&ref_corpname=preloaded/bnc2_tt21&format=json&max_keywords=2
{
    "keywords": [
        {
            "ref_link": "view?corpname=bnc2_tt21;q=q[word="parser"]",
            "frq2": 72,
            "frq1": 24011,
            "item": "parser",
            "score": "297.13",
            "link": "view?corpname=aclarc_1;q=q[word="parser"]"
        },
        {
            "ref_link": "view?corpname=bnc2_tt21;q=q[word="NP"]",
            "frq2": 122,
            "frq1": 29452,
            "item": "NP",
            "score": "286.59",
            "link": "view?corpname=aclarc_1;q=q[word="NP"]"
        }
    ],
    "corpus": "aclarc_1",
    "ref_corpus": "bnc2_tt21"
}
# compare word (lowercase) frequencies between BAWE2 corpus and BNC2
https://api.sketchengine.eu/bonito/run.cgi/extract_keywords?corpname=preloaded/bawe2&ref_corpname=preloaded/bnc2_tt21&attr=lc&format=json
{   
    "Keywords": [
        {
            "str": "website",
            "q": "...",
            "rel_ref": 0.0,
            "score": 103.4,
            "rel": 102.4,
            "freq_ref": 0,
            "freq": 854
         },
         {
            "str": "eu",
            ...
         },
         ...
    ],
    ...
}