Documentation for expert users

This documentation section contains information about using Sketch Engine via API (application programming interface). In a nutshell, this means authentication and using your own scripts to create corpora, use features (word sketch, concordance, etc.) and extract data from Sketch Engine. This type of using Sketch Engine requires knowledge of programming.

If you have a problem with authentication or using a particular feature via API, you can contact us at support@sketchengine.eu However, please bear in mind, we are not able to check all user scripts or write scripts for you.

Sketch Engine JSON API, methods and attributes

The communication with the Sketch Engine through the use of automated HTTP requests consists of the following steps:

Available methods and attributes

See a list of all methods and attributes available in Sketch Engine.

Authentication

Authentication

If you do not have a Sketch Engine account, create either the 30-day trial subscription or a paid subscription.

To generate your API key:

  • log in to Sketch Engine
  • when logged in, click tthe hree-dot icon at tthe op-right corner of the screen and select My account
  • click the Generate new API key button
    (the API key is a long string of letters and numbers)
  • copy the API key and use as described in the API Authentication documentation

Using the API key

The API key must be sent in the “Authorization” HTTP header, as “Bearer API_KEY”. Alternatively “Basic” authentication is also supported, with your Sketch Engine username used as login and your API key as password.

Example in Python (using the “requests” library):

#!/usr/bin/env python

import requests

USERNAME = ''
API_KEY = ''
base_url = 'https://api.sketchengine.eu/bonito/run.cgi'
data = {
 'format': 'json',
 'lemma': 'book',
 'lpos': '-v',
}
d = requests.get(base_url + '/wsketch?corpname=preloaded/bnc2', params=data, auth=(USERNAME, API_KEY)).json()
print "There are %d grammar relations for %s%s (lemma+PoS) in corpus %s." %
        (len(d['Gramrels']), data['lemma'], data['lpos'], data['corpname'])

Creating query

Sketch Engine uses HTTP REST API. All API methods (unless stated otherwise) expect GET HTTP requests.

A description of creating a query via JSON API that you want to work with. A Sketch Engine query is a URL of the following structure:

where

  • base_url is the path to the main CGI script, “run.cgi”.
  • method is the particular method, e.g. “wsketch” for word sketches.
  • attributes_and_values is the list of attributes and values in the CGI notation, that is attribute_1=value_1&attribute_2=value_2& ... &attribute_n=value_n .

See the complete list of methods and attributes.

If the Sketch Engine runs on a local machine, ‘base_url’ usually starts with ‘http://localhost/’.

Since our Service Level Agreement (see FUP) is applicable, you need to limit the frequency of API requests. It can be done using standard libraries in various programming languages, e.g. time.sleep(1) in Python.

An example of Sketch Engine query can look like this:

https://api.sketchengine.eu/bonito/run.cgi/wsketch?corpname=XXX&lemma=test&lpos=-n

XXX – will be replaced with a corpus, e.g. preloaded/brown_1 for the BROWN corpus. Then it is a query that returns word sketch HTML page for test as noun (“lpos=-n”) from this corpus.

Errors

In the case Sketch Engine can’t answer the requests, it throws error. In the case of JSON format, the response contains key “error” with a message explaining what happened. HTTP status value is changed accordingly too.

JSON

Using JSON

JSON (JavaScript Object Notation, http://www.json.org/) is a lightweight data-interchange format. It is easy for humans to read and write as well as for machines to parse and generate. The Sketch Engine offers a possibility of using the JSON format as the input and/or output format.

JSON input

Input in the JSON format can be passed to the Sketch Engine by the universal json attribute. All attribute names and values (including numbers and comma-delimited lists) should be encoded as JSON strings (note that quotation mark characters from the CQL queries must be escaped). Lists of attributes (e.g. by the q attribute in the view method) should be encoded as JSON arrays. Example of a complete query using JSON:

https://api.sketchengine.eu/bonito/run.cgi/view?json={"corpname":"preloaded/bnc", "q":["q[lemma="test"]", "r250"]}

JSON output

In this section, we describe the output of the system in case the format attribute is set to json. The resulting JSON object has quite intuitive structure, so we will describe it here rather briefly. We also do not describe the output completely since there are some data that are used only internally and their description might be confusing (for this reason, there are some fields in the examples that are not described in the output structure and might change in time). In the following, the output of all methods listed before is described. Note also that all structure names (JSON objects, arrays) begin with a capital letter, while atom names (strings, numbers) always are lowercase.

Note also that our API servers limit the number of queries according to our SLA. It means that sometimes, calls might be refused if minute, hour or day quotas are exceeded. In that case, HTTP 429 is sent to a client. You should react to this response and increase intervals between calls accordingly. See the Exceeding FUP limit section (below on this page).

wordlist

Structure of the ‘word list’ query result:

  • Items – list of items in the word list. One item contains:
    • str – string expression of the item (e.g. word)
    • freq – frequency of the item

Structure of the ‘keywords’ query result:

  • Keywords – list of selected keyword items. One item contains:
    • arf – the ARF value
    • cfreq – frequency in the reference (sub)corpus
    • score – item score
    • sfreq – frequency in the selected (sub)corpus
    • str – string expression of the item (e.g. word)

Example (query and result):

https://api.sketchengine.eu/bonito/run.cgi/wordlist?corpname=preloaded/bnc2;wlattr=word;wlpat=test.*;wlsort=f;wlmaxitems=2;format=json

{
   "Items": [
      {
         "freq": 11040,
         "str": "test"
      },
      {
         "freq": 4472,
         "str": "tests"
      }
   ]
}

Example (query and result) – keywords:

https://api.sketchengine.eu/bonito/run.cgi/wordlist?corpname=preloaded/bnc2;wlattr=word;keywords=1;usesubcorp=wri-to-be-spoken;wlsort=f;wlmaxitems=2;ref_corpname=preloaded/bnc;format=json

{
   "Keywords": [
      {
         "arf": 5.9,
         "cfreq": 402,
         "score": 679.1,
         "sfreq": 402,
         "str": "Video-Tape"
      },
      {
         "arf": 47.2,
         "cfreq": 3765,
         "score": 679.1,
         "sfreq": 3765,
         "str": "Video-Taped"
      }
   ]
}

wsketch

Structure:

  • Gramrels – list of grammatical relations including all relevant collocates. Contains:
    • count – overall frequency of the gramrel
    • name – name of the gramrel
    • score – overall score of the gramrel
    • seek – pointer to the concordance (can be used in a w-type query in the view method)
    • Words – list of collocates in the gramrel. Each collocate contains:
      • count – frequency of the collocate in gramrel
      • score – collocate score
      • seek – collocate pointer to the concordance (can be used in a w-type query in the view method)
      • word – string expression of the collocateIf ‘clustered collocations’ are demanded, each collocate can contain information about the collocate cluster:
      • totalcount – overall frequency of the cluster (0 if the cluster is empty)
      • totalseek – cluster pointer to the concordance (can be used in a w-type query in the view method, but must be preceded by comma (‘,’)) (” if the cluster is empty)
      • Clust – list of words in the cluster, each word has attributes count, score, seek, word as described above. If the cluster is empty, this attribute is not included

Example (query and result):

https://api.sketchengine.eu/bonito/run.cgi/wsketch?corpname=preloaded/bnc2;lemma=test;lpos=-n;format=json

{
   "Gramrels": [
      {
         "Words": [
            {
               "Clust": [
                  {
                     "count": 32,
                     "id": 848,
                     "score": 12.63,
                     "seek": 4816731,
                     "word": "run"
                  },

                  ...

               ],
               "count": 294,
               "id": 1029,
               "score": 43.96,
               "seek": 4816743,
               "totalcount": 384,
               "totalseek": "4816743,4816731,4816760,4816700,4816806,4816675",
               "word": "pass"
            },

            ...

         ],
         "count": 3406,
         "name": "object_of",
         "score": 2.1,
         "seek": 79181
      },

    ...

thes

Structure:

  • Words – list of similar words. Each word contains:
    • score – word score
    • word – string expression of the wordIf ‘clustered items’ are demanded, each word can contain information about the word cluster:
    • Clust – list of words in the cluster, each word has attributes score, word as described above. If the cluster is empty, this attribute is not included
  • freq – frequency of the selected lemma in corpus

Example (query and result):

https://api.sketchengine.eu/bonito/run.cgi/thes?corpname=preloaded/bnc2;lemma=test;lpos=-n;maxthesitems=6;clusteritems=1;format=json

{
   "Words": [
      {
         "Clust": [
            {
               "id": 4226,
               "score": 0.223,
               "word": "examination"
            }
         ],
         "id": 941,
         "score": 0.243,
         "totalcount": 0,
         "totalseek": "",
         "word": "assessment"
      },

      ...

   ],
   "commonurl": "corpname=preloaded/bnc;lemma=test;lpos=-n",
   "freq": 15789,
   "lemma": "test",
   "lpos": "-n"
}

wsdiff

This method does not currently support JSON output.

view

Structure:

  • Lines – list of concordance lines. Each line contains:
    • Kwic – list of KWIC segments (segment stands for one or more tokens). Each segment contains:
      • class – class name of the segment (e.g. ‘attr’ = attribute, ‘coll’ = collocation etc.)
      • str – string expression of the segment (attributes are preceded by ‘/’ for correct display on the HTML page)
    • Left – list of left context segments (same structure as Kwic)
    • Right – list of right context segments (same structure as Kwic)
    • ref – line reference (‘reference’ field content)
    • toknum – token number (of the first token in KWIC)
  • concsize – number of lines in concordance (or number of hits)
  • numofpages – number of pages in concordance

Example (query and result):

https://api.sketchengine.eu/bonito/run.cgi/view?corpname=preloaded/bnc2;q=q[lemma="drug"][lemma="test"];pagesize=2;ctxattrs=word,tag;format=json

{
   "Lines": [
      {
         "Align": [],
         "Kwic": [
            {
               "class": "col0 coll",
               "str": " drug test"
            }
         ],
         "Left": [
            {
               "class": "attr",
               "str": "/VM0"
            },
            {
               "class": "",
               "str": " be"
            },

            ...

         ],
         "Right": [
            {
               "class": "",
               "str": " at"
            },

            ...

         ],
         "hitlen": ";hitlen=2",
         "leftspace": "",
         "linegroup": "_",
         "ref": "A0M",
         "toknum": 654026
      },

      ...

   ],
   "concsize": 70,
   "fromp": 1,
   "lastlink": "fromp=35",
   "nextlink": "fromp=2",
   "numofpages": 35
}

freqs

Structure:

  • Blocks – list of frequency blocks (tables). Each table contains:
    • Head – list of the table headings. Each heading contains:
      • n – string representation of the heading (name of the column)
      • s – ID of the column, can be used as a value of the freq_sort attribute
    • Items – list of lines in the table. Each line contains:
      • Word – list of items in the left part of the table (i.e. all columns except ‘Freq’ and “Rel[%]” column). Each item contains:
        • n – string representation of the item
      • freq – frequency (content of the “Freq” column)
      • rel – content of the “Rel[%]” column. If the column is not present, this attribute is not included

Example (query and result):

https://api.sketchengine.eu/bonito/run.cgi/freqs?q=q[lemma="test"];corpname=preloaded/bnc2;fcrit=word/+0+lemma/+0+tag/+0;flimit=3000;ml=1;format=json

{
   "Blocks": [
      {
         "Head": [
            {
               "n": "word",
               "s": 0
            },
            {
               "n": "lemma",
               "s": 1
            },
            {
               "n": "tag",
               "s": 2
            },
            {
               "n": "Freq",
               "s": "freq"
            }
         ],
         "Items": [
            {
               "Word": [
                  {
                     "n": "test"
                  },
                  {
                     "n": "test"
                  },
                  {
                     "n": "NN1"
                  }
               ],
               "fbar": 301,
               "freq": 8609,
               "norel": 1
            },

            ...

collx

Structure:

  • Head – list of table headings. Each heading contains:
    • n – name of the column. Can be empty.
    • s – column ID. Can be used as a value of the csortfn attribute. If n is empty, this is not included
  • Items – list of table lines. Each line contains:
    • Stats – list of the statistics in the line (in the same order as in the heading). Each statistic contains:
      • n – value itself (content of the column)
    • freq – collocation frequency
    • str – string expression of the collocate

Example (query and result):

https://api.sketchengine.eu/bonito/run.cgi/collx?q=q[lemma="test"];corpname=preloaded/bnc2;csortfn=m

{
   "Head": [
      {
         "n": ""
      },
      {
         "n": "Freq",
         "s": "f"
      },
      {
         "n": "T-score",
         "s": "t"
      },
      {
         "n": "MI",
         "s": "m"
      }
   ],
   "Items": [
      {
         "Stats": [
            {
               "s": "2.828"
            },
            {
               "s": "12.938"
            }
         ],
         "freq": 8,
         "nfilter": "q=n-5 5 1 [word="Belvin"]",
         "pfilter": "q=p-5 5 1 [word="Belvin"]",
         "str": "Belvin"
      },

      ...

save* methods

These methods return the same output as their mother methods (see above) and are deprecated to be used for JSON output.

subcorp

Structure:

  • Subcorplist – available subcorpora list. Each subcorpus contains:
    • n – name of the subcorpus

Fields available only if new subcorpus is created:

  • corpsize – size of the mother corpus (number of tokens)
  • subcsize – size of the created subcorpus (number of tokens)

Example (query and result):

https://api.sketchengine.eu/bonito/run.cgi/subcorp?corpname=preloaded/bnc2;format=json

{
   "SubcorpList": [
      {
         "n": "book"
      },
      {
         "n": "wri-to-be-spoken"
      }
   ]
}

Corpus creation

The examples are in Python. Stitching the code snippets and replacing placeholders will produce a functional code. This API is not versioned and is subject to development.


To create a new corpus from your own files via the API, follow these setps:

  1. authenticate yourself
  2. create a new corpus for a given language
  3. upload files and
  4. wait for processing.

When finished, the corpus can be accessed via the API as usual (see what you can do). The available queries will depend on the language and the content. A few Python modules and your API key are required. Copy the API from the Sketch Engine interface by navigating to My account, the icon in the top right of the interface.

#!/usr/bin/python

import json
import requests
import time

auth = ('%username%', '%api_key%')
URL = 'https://api.sketchengine.eu/ca/api'

Set the language of the corpus (the example uses English) and the corpus name.

r = requests.post(URL + '/corpora', auth=auth, json={
    'language_id': 'en',
    'name': 'api_test'
})

Use ISO 639-1 language codes. The API provides a list of all languages supported by Sketch Engine.

All responses are in JSON. Future calls to the corpus building API will require the numeric corpus ID and the corpus querying API will require the textual corpname.

corpus_id = r.json()['data']['id']
corpus_url = URL + '/corpora/' + str(corpus_id)
corpname = r.json()['data']['corpname']

To upload files, their names, actual content and MIME type are required.

files = {'file': ('testing.txt', open('/path/to/your/file/testing.txt', 'rb'), 'text/plain')}
r = requests.post(corpus_url + '/documents', auth=auth, files=files, params={'feeling': 'lucky'})

When files are sent to the corpus, they are processed automatically.  It is necessary to wait until the processing is done before compiling the corpus. Check the status of the corpus periodically:

while True:
    time.sleep(5)
    r = requests.post(corpus_url + '/can_be_compiled', json={}, auth=auth)
    if r.json()['result']['can_be_compiled']:
        break

Once the files are converted and tagged, the above call will return True.

It is possible to keep uploading additional files even if the processing of the previously uploaded files has not finished. Uploading too many files too quickly may cause the FUP limit to be reached. The call will only return True after all the files have been processed.

Compilation is required to query the corpus later.  This is how to check the status of the compilation:

r = requests.post(corpus_url + '/compile', json={'structures': 'all'}, auth=auth)
while True:
    time.sleep(5)
    r = requests.get(corpus_url + '/get_progress', json={}, auth=auth)
    progress = r.json()['result']['progress']
    if progress < 1 or progress > 99:
        break

Progress 100 means that the compilation finished successfully and the corpus is ready for querying. Use the corpname attribute as the identifier for corpus querying.

Progress -1 means that the compilation failed and the error message can be found in the result.

If you have any questions or need to report a problem, contact us at support@sketchengine.eu

API examples

See the API example for the tool cURL (in command line). In it we query BNC corpus for a wordlist. We can send a blacklist along the query to the server with words which will be filtered out from the results. The example uses API key which you get in My account (three dots icon at the top-right corner of the screen).

#!/bin/bash

cat >bl.wl << EOF and the of in on at a an to that is EOF
curl -F "corpname=preloaded/bnc2_tt21" -F "wlsort=f" -F "wlattr=word" -F "format=json" -H "Content-Type: multipart/form-data" -F "wlblacklist=@bl.wl" --user "USERNAME:APIKEY" "https://api.sketchengine.eu/bonito/run.cgi/wordlist" > result.json

cat result.json

Discrepancies between API and interface results

When you query a corpus in the web interface you may notice that the result is obtained very quickly even for quite large corpora. It is possible only with asynchronous processing of the query: you see instantly (only) part of the result while the rest being computed asynchronously in the background. It is indicated by the growing number of hits on the result page. Once the query is fully processed the counting stops. But if you query a corpus via API you don’t usually want SkE to behave like this so you can disable it by putting parameter async=0 into the URL. See the documentation of view method.

WARNING!

Exceeding the limit of API requests

Our API servers apply FUP (Fair usage policy) which is defined in our Service level agreement. If you happen to exceed the quotas, our server will respond with an error HTTP 429 Too many requests. If you require JSON output, it is ignored and an HTML response together with the HTTP error is returned. You should be able to react to this situation and stop your API script since all further request ends up with HTTP 429. It is advisable to increase the interval between queries.

It depends on how many calls you will request, but a simple rule might be:

  • if you want to make fewer than 50 requests, you don’t need to use any waiting,
  • if you want to make up to 900 requests, you need to use the interval of 4 seconds per query,
  • if you want to make more than 2000 requests, you need to use interval ca 44 seconds.

If the queries take some time, you may decrease the interval.

If you exceeded the limit of API requests (mentioned in Service level agreement), you can use our testing account with the following details:

  • login: api_testing
  • api_key: YNSC0B9OXN57XB48T9HWUFFLPY4TZ6OE