Sketch Engine JSON API, methods and attributes

The communication with the Sketch Engine through the use of automated HTTP requests consists of the following steps:

Authentication

Authentication

Authentication is an optional feature that can be omitted in local installations. In such a case, simple requests can be issued to Sketch Engine. However, servers usually require some sort of user authentication. Local installations typically use Basic http authentication, while our servers authenticate users via Corpus Architect.

No authentication

Minimalistic non-authenticated API request on the local computer:

Example in Java:

import java.net.*;
import java.io.*;

public class GetURL {
    public static void main(String[] args) throws Exception {
         // url with the query
        String url_string = "http://localhost/run.cgi/wordlist?corpname=bnc;wlattr=word;wlminfreq=5;wlmaxitems=100;wlpat=test.*;format=json";

        // connecting the SketchEngine Server
        URL url = new URL(url_string);
        InputStream stream = url.openStream();
        InputStreamReader isr = new InputStreamReader(stream);
        BufferedReader reader = new BufferedReader(isr);
        
        try {
            Thread.sleep(10000);
        } catch (InterruptedException ex) {
            Thread.currentThread().interrupt();
        }

        // data receiving
        System.out.println(reader.readLine()); // json data are on the first line
    }
}

Example in Python 2.7:

import urllib2
import time

url = "http://localhost/run.cgi/wordlist?corpname=bnc;wlattr=word;wlminfreq=5;wlmaxitems=100;wlpat=test.*;format=json"
request = urllib2.Request(url)

# data receiving
file = urllib2.urlopen(request)
data = file.read()
file.close()
time.sleep(10)

print data

Basic http authentication

A common variant on local installations but not compatible with the api.sketchengine.co.uk server.

Sample in Java code: (download the full code)

import java.net.Authenticator;
import java.net.PasswordAuthentication;

...
        final String usr = "";
        final String passwd = "";
        
            // authentication issues
        Authenticator auth = new Authenticator() {
            protected PasswordAuthentication  getPasswordAuthentication () {
                return new PasswordAuthentication(usr, passwd.toCharArray());
            }
        };
        Authenticator.setDefault(auth);
...

Example in Python 2.7: (download)

#!/usr/bin/env python

# a test for demonstration using Sketch Engine through json interface

import urllib, urllib2, base64, json

usr = ''
passwd = ''

base_url = 'http://localhost/auth/run.cgi/'
method = 'view'

# creating query string
attrs = dict(corpname='bnc2', q='', pagesize='1', format='json')
# query_list can be read from a file, ...
query_list = ['[lemma="test"]',
              '[lemma="drug"][lemma="test"]',
              '[lemma="blood"][lemma="test"]',
              '[lemma="test"][lemma="result"]'
             ]

for query in query_list:
    attrs['q'] = 'q' + query

    encoded_attrs = urllib.quote(json.dumps(attrs))
    url = base_url + method + '?json=%s' % encoded_attrs

    request = urllib2.Request(url)

    # authentication
    base64string = base64.encodestring('%s:%s' % (usr, passwd))[:-1]
    request.add_header("Authorization", "Basic %s" % base64string)

    # json data receiving
    file = urllib2.urlopen(request)
    data = file.read()
    file.close()

    # now, in the 'data' variable, there is a json string that can be parsed
    # for json syntax
    json_obj = json.loads(data)

    print query + 't' + str(json_obj.get('concsize', '0'))

Example in R: (download)

library(RCurl)
# build a URL
result <- getURL("URL", userpwd="USERNAME:PASSWORD", httpauth = 1L)
Sys.sleep(10)
# do something with the result

API key authentication

Example in Python (using module requests):

#!/usr/bin/env python

import requests
base_url = 'https://api.sketchengine.co.uk/bonito/run.cgi'
data = {
    'corpname': 'bnc2',
    'format': 'json',
    'lemma': 'book',
    'lpos': '-v',
    'username': '',
    'api_key': ''
    # get it here: https://app.sketchengine.co.uk/auth/api_access/
}
d = requests.get(base_url + '/wsketch', params=data).json()
print "There are %d grammar relations for %s%s (lemma+PoS) in corpus %s." %
        (len(d['Gramrels']), data['lemma'], data['lpos'], data['corpname'])

Example in R:

# install.packages('RCurl')
library(RCurl)

# get your API key here: https://app.sketchengine.co.uk/auth/api_access/
url_prefix = "https://api.sketchengine.co.uk/corpus/thes?corpname=bnc2;lemma=mother;lpos=;"
parameters = "format=json;api_key=;username="
url <- paste(url_prefix, parameters, sep="")
result <- getURL(url)
print(result)
gc()

Corpus Architect authentication

Authentication method used on our servers (http://old.sketchengine.co.uk and  http://api.sketchengine.).

Sample in Java code (download the full code); required non-standard libraries which can be downloaded from on the example bookmark.

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.cookie.CookiePolicy;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.protocol.*;
import org.apache.commons.httpclient.contrib.ssl.*;
...
class example3_ca {

    static final String root_url = "api.sketchengine.co.uk";
    static final String ske_username = "";
    static final String ske_password = "";
   
    public static void main(String[] args) {
        
        String corp = "bnc";
        String method = "view";
        String base_url = "/bonito/run.cgi/";

        ...

        // make HTTPS connection
        HttpClient client = new HttpClient();
        try {
          Protocol.registerProtocol("https", new Protocol("https", (ProtocolSocketFactory)new EasySSLProtocolSocketFactory(), 443));
          //client.getHostConfiguration().setHost(root_url, 80, "http");
          client.getHostConfiguration().setHost(root_url, 443, "https");
          client.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
        } catch (java.security.GeneralSecurityException e){
          e.printStackTrace();
        } catch (IOException e){
          e.printStackTrace();
        }
        client.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
       
        // retrieve session id
        GetMethod authget = new GetMethod("/login/");
        try {
            int code=client.executeMethod(authget);
        } catch (IOException ex) {
            System.err.println("Error: couldn't retrieve session ID from Sketch Engine server.");
            System.exit(1);
        }
        authget.releaseConnection();
       
        // login   
        PostMethod authpost = new PostMethod("/login/");
        NameValuePair submit   = new NameValuePair("submit", "ok");
        NameValuePair username = new NameValuePair("username", ske_username);
        NameValuePair password = new NameValuePair("password", ske_password);
        authpost.setRequestBody(new NameValuePair[] {submit, username, password});
           try {
             int code=client.executeMethod(authpost);
        } catch (IOException ex) {
            System.err.println("Error: couldn't login to Sketch Engine server.");
            System.exit(2);
        }
        authpost.releaseConnection();

        try {
            Thread.sleep(10000);
        } catch (InterruptedException ex) {
            Thread.currentThread().interrupt();
        }

        // retrieve data
...

Example in Python 2.7: (download)

#!/usr/bin/env python

import urllib, urllib2, cookielib
import simplejson
import Cookie

username = ''
password = ''
corp = 'bnc2'

root_url = 'https://api.sketchengine.co.uk'

base_url = '%s/bonito/run.cgi/' % root_url
method = 'view'

# creating query string
attrs = dict(corpname=corp, q='', pagesize='1', format='json')
# query_list can be read from a file, ...
query_list = ['[lemma="test"]',
              '[lemma="drug"][lemma="test"]',
              '[lemma="blood"][lemma="test"]',
              '[lemma="test"][lemma="result"]'
             ]

# authentication
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({ 'username' : username,
                                'password' : password,
                                'submit' : 'ok',
                              })
data = opener.open('%s/login/' % root_url)
data = opener.open('%s/login/' % root_url, login_data)

for query in query_list:
    attrs['q'] = ['q' + query, 'r100']
    encoded_attrs = urllib.quote(simplejson.JSONEncoder().encode(attrs))
    url = base_url + method + '?json=%s' % encoded_attrs
    # json data receiving
    file = opener.open(url)
    data = file.read()
    file.close()

    # now, in the 'data' variable, there is a json string that can be parsed
    # for json syntax (e.g. by simplejson)
    json_obj = simplejson.loads(data)

    print query + 't' + str(json_obj.get('concsize', '0'))

Example in R:

library(RCurl)
library(rjson)

loginurl = "https://api.sketchengine.co.uk/login/"
dataurl = "https://api.sketchengine.co.uk/corpus/view?q=alc,[lemma="book"];corpname=bnc2;format=json"

# authentication parameters
pars=list(
    username="USERNAME",
    password="PASSWORD"
)

# setup curl
agent="Mozilla/5.0"
curl = getCurlHandle()
curlSetOpt(cookiejar="cookies.txt", useragent=agent, followlocation=TRUE, curl=curl)

# authenticate with login form
postForm(loginurl, .params = pars, curl=curl)

# access the requested URL
html=getURL(dataurl, curl=curl)
Sys.sleep(10)

# parse JSON result
document <- fromJSON(html, method='C')

# work with the object
show(document)

# clean up
rm(curl)
gc()

Example in Bash

# authenticate with cookies
wget --save-cookies ca_cookies.txt 
     --post-data 'username=USERNAME&password=PASSWORD' 
     https://app.sketchengine.co.uk/login/ 
     -O /dev/null

# call the URL
wget --load-cookies ca_cookies.txt 
     "https://app.sketchengine.co.uk/bonito/run.cgi/wordlist?corpname=preloaded%2Fbnc2&wlattr=word&wlpat=%5Epro%2E%2A&format=json" 
     -O result.json
sleep 10

# work with the result
cat result.json

Creating query

Sketch Engine uses HTTP REST API. All API methods (unless stated otherwise) expect GET HTTP requests.

A description of creating a query via JSON API that you want to work with. A Sketch Engine query is a URL of the following structure:

where

  • base_url is the path to the main CGI script, “run.cgi”.
  • method is the particular method, e.g. “wsketch” for word sketches.
  • attributes_and_values is the list of attributes and values in the CGI notation, that is attribute_1=value_1&attribute_2=value_2& ... &attribute_n=value_n .

See the complete list of methods and attributes.

If the Sketch Engine runs on a local machine, ‘base_url’ usually starts with ‘http://localhost/’.

Since our Service Level Agreement (see FUP) is applicable, you need to limit the frequency of API requests. It can be done using standard libraries in various programming languages, e.g. time.sleep(1) in Python.

An example of Sketch Engine query can look like this:

https://api.sketchengine.co.uk/bonito/run.cgi/wsketch?corpname=XXX&lemma=test&lpos=-n

XXX – will be replaced with a corpus, e.g. preloaded/brown_1 for the BROWN corpus. Then it is a query that returns word sketch HTML page for test as noun (“lpos=-n”) from this corpus.

Errors

In the case Sketch Engine can’t answer the requests, it throws error. In the case of JSON format, the response contains key “error” with a message explaining what happened. HTTP status value is changed accordingly too.

JSON

Using JSON

JSON (JavaScript Object Notation, http://www.json.org/) is a lightweight data-interchange format. It is easy for humans to read and write as well as for machines to parse and generate. The Sketch Engine offers a possibility of using the JSON format as the input and/or output format.

JSON input

Input in the JSON format can be passed to the Sketch Engine by the universal json attribute. All attribute names and values (including numbers and comma-delimited lists) should be encoded as JSON strings (note that quotation mark characters from the CQL queries must be escaped). Lists of attributes (e.g. by the q attribute in the view method) should be encoded as JSON arrays. Example of a complete query using JSON:

https://api.sketchengine.co.uk/bonito/run.cgi/view?json={"corpname":"preloaded/bnc", "q":["q[lemma="test"]", "r250"]}

JSON output

In this section, we describe the output of the system in case the format attribute is set to json. The resulting JSON object has quite intuitive structure, so we will describe it here rather briefly. We also do not describe the output completely since there are some data that are used only internally and their description might be confusing (for this reason, there are some fields in the examples that are not described in the output structure and might change in time). In the following, the output of all methods listed before is described. Note also that all structure names (JSON objects, arrays) begin with a capital letter, while atom names (strings, numbers) always are lowercase.

Note also that our API servers limit the number of queries according to our SLA. It means that sometimes, calls might be refused if minute, hour or day quotas are exceeded. In that case, HTTP 429 is sent to a client. You should react to this response and increase intervals between calls accordingly. See the Exceeding FUP limit section (below on this page).

wordlist

Structure of the ‘word list’ query result:

  • Items – list of items in the word list. One item contains:
    • str – string expression of the item (e.g. word)
    • freq – frequency of the item

Structure of the ‘keywords’ query result:

  • Keywords – list of selected keyword items. One item contains:
    • arf – the ARF value
    • cfreq – frequency in the reference (sub)corpus
    • score – item score
    • sfreq – frequency in the selected (sub)corpus
    • str – string expression of the item (e.g. word)

Example (query and result):

https://api.sketchengine.co.uk/bonito/run.cgi/wordlist?corpname=preloaded/bnc;wlattr=word;wlpat=test.*;wlsort=f;wlmaxitems=2;format=json

{
   "Items": [
      {
         "freq": 11040,
         "str": "test"
      },
      {
         "freq": 4472,
         "str": "tests"
      }
   ]
}

Example (query and result) – keywords:

https://api.sketchengine.co.uk/bonito/run.cgi/wordlist?corpname=preloaded/bnc;wlattr=word;keywords=1;usesubcorp=wri-to-be-spoken;wlsort=f;wlmaxitems=2;ref_corpname=preloaded/bnc;format=json

{
   "Keywords": [
      {
         "arf": 5.9,
         "cfreq": 402,
         "score": 679.1,
         "sfreq": 402,
         "str": "Video-Tape"
      },
      {
         "arf": 47.2,
         "cfreq": 3765,
         "score": 679.1,
         "sfreq": 3765,
         "str": "Video-Taped"
      }
   ]
}

wsketch

Structure:

  • Gramrels – list of grammatical relations including all relevant collocates. Contains:
    • count – overall frequency of the gramrel
    • name – name of the gramrel
    • score – overall score of the gramrel
    • seek – pointer to the concordance (can be used in a w-type query in the view method)
    • Words – list of collocates in the gramrel. Each collocate contains:
      • count – frequency of the collocate in gramrel
      • score – collocate score
      • seek – collocate pointer to the concordance (can be used in a w-type query in the view method)
      • word – string expression of the collocateIf ‘clustered collocations’ are demanded, each collocate can contain information about the collocate cluster:
      • totalcount – overall frequency of the cluster (0 if the cluster is empty)
      • totalseek – cluster pointer to the concordance (can be used in a w-type query in the view method, but must be preceded by comma (‘,’)) (” if the cluster is empty)
      • Clust – list of words in the cluster, each word has attributes count, score, seek, word as described above. If the cluster is empty, this attribute is not included

Example (query and result):

https://api.sketchengine.co.uk/bonito/run.cgi/wsketch?corpname=preloaded/bnc;lemma=test;lpos=-n;format=json

{
   "Gramrels": [
      {
         "Words": [
            {
               "Clust": [
                  {
                     "count": 32,
                     "id": 848,
                     "score": 12.63,
                     "seek": 4816731,
                     "word": "run"
                  },

                  ...

               ],
               "count": 294,
               "id": 1029,
               "score": 43.96,
               "seek": 4816743,
               "totalcount": 384,
               "totalseek": "4816743,4816731,4816760,4816700,4816806,4816675",
               "word": "pass"
            },

            ...

         ],
         "count": 3406,
         "name": "object_of",
         "score": 2.1,
         "seek": 79181
      },

    ...

thes

Structure:

  • Words – list of similar words. Each word contains:
    • score – word score
    • word – string expression of the wordIf ‘clustered items’ are demanded, each word can contain information about the word cluster:
    • Clust – list of words in the cluster, each word has attributes score, word as described above. If the cluster is empty, this attribute is not included
  • freq – frequency of the selected lemma in corpus

Example (query and result):

https://api.sketchengine.co.uk/bonito/run.cgi/thes?corpname=preloaded/bnc;lemma=test;lpos=-n;maxthesitems=6;clusteritems=1;format=json

{
   "Words": [
      {
         "Clust": [
            {
               "id": 4226,
               "score": 0.223,
               "word": "examination"
            }
         ],
         "id": 941,
         "score": 0.243,
         "totalcount": 0,
         "totalseek": "",
         "word": "assessment"
      },

      ...

   ],
   "commonurl": "corpname=preloaded/bnc;lemma=test;lpos=-n",
   "freq": 15789,
   "lemma": "test",
   "lpos": "-n"
}

wsdiff

This method does not currently support JSON output.

view

Structure:

  • Lines – list of concordance lines. Each line contains:
    • Kwic – list of KWIC segments (segment stands for one or more tokens). Each segment contains:
      • class – class name of the segment (e.g. ‘attr’ = attribute, ‘coll’ = collocation etc.)
      • str – string expression of the segment (attributes are preceded by ‘/’ for correct display on the HTML page)
    • Left – list of left context segments (same structure as Kwic)
    • Right – list of right context segments (same structure as Kwic)
    • ref – line reference (‘reference’ field content)
    • toknum – token number (of the first token in KWIC)
  • concsize – number of lines in concordance (or number of hits)
  • numofpages – number of pages in concordance

Example (query and result):

https://api.sketchengine.co.uk/bonito/run.cgi/view?corpname=preloaded/bnc;q=q[lemma="drug"][lemma="test"];pagesize=2;ctxattrs=word,tag;format=json

{
   "Lines": [
      {
         "Align": [],
         "Kwic": [
            {
               "class": "col0 coll",
               "str": " drug test"
            }
         ],
         "Left": [
            {
               "class": "attr",
               "str": "/VM0"
            },
            {
               "class": "",
               "str": " be"
            },

            ...

         ],
         "Right": [
            {
               "class": "",
               "str": " at"
            },

            ...

         ],
         "hitlen": ";hitlen=2",
         "leftspace": "",
         "linegroup": "_",
         "ref": "A0M",
         "toknum": 654026
      },

      ...

   ],
   "concsize": 70,
   "fromp": 1,
   "lastlink": "fromp=35",
   "nextlink": "fromp=2",
   "numofpages": 35
}

freqs

Structure:

  • Blocks – list of frequency blocks (tables). Each table contains:
    • Head – list of the table headings. Each heading contains:
      • n – string representation of the heading (name of the column)
      • s – ID of the column, can be used as a value of the freq_sort attribute
    • Items – list of lines in the table. Each line contains:
      • Word – list of items in the left part of the table (i.e. all columns except ‘Freq’ and “Rel[%]” column). Each item contains:
        • n – string representation of the item
      • freq – frequency (content of the “Freq” column)
      • rel – content of the “Rel[%]” column. If the column is not present, this attribute is not included

Example (query and result):

https://api.sketchengine.co.uk/bonito/run.cgi/freqs?q=q[lemma="test"];corpname=preloaded/bnc;fcrit=word/+0+lemma/+0+tag/+0;flimit=3000;ml=1;format=json

{
   "Blocks": [
      {
         "Head": [
            {
               "n": "word",
               "s": 0
            },
            {
               "n": "lemma",
               "s": 1
            },
            {
               "n": "tag",
               "s": 2
            },
            {
               "n": "Freq",
               "s": "freq"
            }
         ],
         "Items": [
            {
               "Word": [
                  {
                     "n": "test"
                  },
                  {
                     "n": "test"
                  },
                  {
                     "n": "NN1"
                  }
               ],
               "fbar": 301,
               "freq": 8609,
               "norel": 1
            },

            ...

collx

Structure:

  • Head – list of table headings. Each heading contains:
    • n – name of the column. Can be empty.
    • s – column ID. Can be used as a value of the csortfn attribute. If n is empty, this is not included
  • Items – list of table lines. Each line contains:
    • Stats – list of the statistics in the line (in the same order as in the heading). Each statistic contains:
      • n – value itself (content of the column)
    • freq – collocation frequency
    • str – string expression of the collocate

Example (query and result):

https://api.sketchengine.co.uk/bonito/run.cgi/collx?q=q[lemma="test"];corpname=preloaded/bnc;csortfn=m

{
   "Head": [
      {
         "n": ""
      },
      {
         "n": "Freq",
         "s": "f"
      },
      {
         "n": "T-score",
         "s": "t"
      },
      {
         "n": "MI",
         "s": "m"
      }
   ],
   "Items": [
      {
         "Stats": [
            {
               "s": "2.828"
            },
            {
               "s": "12.938"
            }
         ],
         "freq": 8,
         "nfilter": "q=n-5 5 1 [word="Belvin"]",
         "pfilter": "q=p-5 5 1 [word="Belvin"]",
         "str": "Belvin"
      },

      ...

save* methods

These methods return the same output as their mother methods (see above) and are deprecated to be used for JSON output.

subcorp

Structure:

  • Subcorplist – available subcorpora list. Each subcorpus contains:
    • n – name of the subcorpus

Fields available only if new subcorpus is created:

  • corpsize – size of the mother corpus (number of tokens)
  • subcsize – size of the created subcorpus (number of tokens)

Example (query and result):

https://api.sketchengine.co.uk/bonito/run.cgi/subcorp?corpname=preloaded/bnc;format=json

{
   "SubcorpList": [
      {
         "n": "book"
      },
      {
         "n": "wri-to-be-spoken"
      }
   ]
}

Generating the API key

The Sketch Engine API key

If you do not have a Sketch Engine account, create either the 30-day trial subscription or a paid subscription.

To generate your API key:

  • log in to Sketch Engine
  • when logged in, click tthe hree-dot icon at tthe op-right corner of the screen and select My account
  • click the Generate new API key button
    (the API key is a long string of letters and numbers)
  • copy the API key and use as described in the API Authentication documentation

Corpus creation

This documentation is in the form of Python examples. If you stitch the code snippets from this page together and replace placeholders in them, it should work like a charm. However, this API is still a work in progress. Things will break without warning. You’ve been warned.


If you have your own files, you can create a new corpus using our API within just a few steps:

  1. authenticate yourself,
  2. create a new corpus for a given language,
  3. upload files and then
  4. wait for processing.

After these steps, you will be able to access your corpus with API as usual (see what you can do). Of course, the variety of available queries will depend on the language and the content (size) of the files. So let’s start. You will need a few Python modules and your API key which you can get here.

#!/usr/bin/python
import json
import requests
import time

auth = ('%username%', '%api_key%')
URL = 'https://app.sketchengine.co.uk/api'

Before creating a corpus, you need to know what language you will be using. Let’s stick with English for now.

r = requests.post(URL + '/corpora', auth=auth, json={
    'language_id': 'en',
    'name': 'api_test'
})

You needed only two parameters: the language of the corpus and its name. Use ISO 639-1 language codes. The API provides also a list of all languages supported by Sketch Engine.
We recommend using only ASCII (uppercase and lowercase Latin) characters in corpus names.

All responses are in JSON, you will need corpus ID for the future calls, this way you get it:

corpus_id = r.json()['data']['id']

Now let’s upload some files. You need to provide their names, actual content and MIME type. Here’s an example.

files = {'file': ('testing.txt', open('/path/to/your/file/testing.txt', 'rb'), 'text/plain')}
r = requests.post(URL + '/corpora/' + str(corpus_id) + '/documents', auth=auth, files=files, params={'feeling': 'lucky'})

When you send files to the corpus, they are automatically processed which takes some time. You need to wait until the processing is done before starting corpus compilation. Check the compilation status of the corpus periodically:

r = requests.get(URL + '/corpora/' + str(corpus_id) + '/compilation', auth=auth)
status = r.json()['data']['status']
while status != 'READY':
    r = requests.get(URL + '/corpora/' + str(corpus_id) + '/compilation', auth=auth)
    status = r.json()['data']['status']
    time.sleep(5)

Once the files are converted and tagged, the status of the corpus will be READY. And that’s time to run the compilation so you can query the corpus later. The compilation takes also some time so you need to wait again.

r = requests.post(URL + '/corpora/' + str(corpus_id) + '/compilation', json={}, auth=auth)
status = r.json()['data']['status']
while status != 'COMPILED':
    r = requests.get(URL + '/corpora/' + str(corpus_id) + '/compilation', auth=auth)
    status = r.json()['data']['status']
    time.sleep(5)

Here you go! The status now should be COMPILED and you are free to use the corpus. Use the corpname attribute as identifier for corpus querying.

If the status is READY after running a compilation, it means that the compilation probably failed.

If you have any questions or need to report a problem, contact us at support@sketchengine.co.uk

Happy hacking!

Each of the steps is described in a separate section with examples in JAVA and Python:

API examples

This page provides links to various API scripts that show how the Sketch Engine can be accessed automatically.

Example – Corpus Architect authentication and JSON

This example will show you how to access the Sketch Engine API through the Corpus Architect (at http://api.sketchengine.co.uk) and get frequencies for a list of CQL queries by the means of JSON (compare to http authenticated version below). When querying our servers (*.sketchengine.co.uk), please run only one query at once and wait 10 seconds before you run another query.

(the java version needs a module for JSON parsing and a non-standard org.apache.commons.httpclient module)

Example in Java (use attached file not-yet-commons-ssl-0.3.11.jar)

Example in Python 2.7 and in Python 3

Examples – Basic http authentication and JSON

This example demonstrates how to get a list of frequencies from a list of CQL queries. Using the basic http authentication (it is common for local installations but not compatible with the api.sketchengine.co.uk server) and JSON query (compare to the CA auth version above).

Examples to get a list frequencies from a list of CQL queries (Java and Python 2.7)

The following example presents an easy way how to convert usual structures (dictionaries for Python, Maps for Java) to JSON objects and how to use the obtained JSON objects as a query to Sketch Engine. Available for Java and Python.

Examples to convert usual structures to JSON (Java and Python 2.7)

This example presents how to connect the Sketch Engine service on your server using Basic http authentication (it will not work for sketchengine.co.uk, CA auth), send a query (in this particular case simple word list query without JSON) and parse the result for JSON syntax. Note that many modules for JSON parsing are available, you do not have to use the one from the examples.

Examples to connect the Sketch Engine service (Java and Python 2.7)

Minimalistic example – no authentication, no JSON parsing

The examples mentioned in the Authentication subsection (above) show bare bones of getting results from the Sketch Engine without any authentication or parsing overhead.

Example frequencies for a list of CQL queries by JSON

This example will show you how to access the Sketch Engine API through the Corpus Architect (at http://api.sketchengine.co.uk) and get frequencies for a list of CQL queries by the means of JSON. It needs a module for JSON parsing and a non-standard org.apache.commons.httpclient module)

Note: String corp must be in the format ‘preloaded/’ or ‘/user//’, e.g. the BNC 2 corpus has the format ‘preloaded/bnc2’

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.cookie.CookiePolicy;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.protocol.*;
import org.apache.commons.httpclient.contrib.ssl.*;
import org.json.*;
import java.util.*;
import java.io.*;

/**
 * @author Vojtech Kovar, xkovar3(at)fi.muni.cz
 * @author Milos Jakubicek, xjakub(at)fi.muni.cz
 * @author Nosmo King, hatejava@mailinator.com
 */

class example3_ca {

    static final String root_url = "api.sketchengine.co.uk";
    static final String ske_username = "";
    static final String ske_password = "";
   
    public static void main(String[] args) {
       
        String corp = "bnc2";
        String method = "view";
        String base_url = "/bonito/run.cgi/";
        // specifying attributes
        Map attrs = new HashMap();
        attrs.put("corpname", corp);
        attrs.put("pagesize", "1");
        attrs.put("format", "json");
       
        // query list can be loaded from a file, ...
        int qlist_size = 4;
        String query_list[] = new String[qlist_size];
        query_list[0] = "[lemma="test"]";
        query_list[1] = "[lemma="drug"][lemma="test"]";
        query_list[2] = "[lemma="blood"][lemma="test"]";
        query_list[3] = "[lemma="test"][lemma="result"]";

        // make HTTPS connection
        String cookie_policy = CookiePolicy.DEFAULT; //use CookiePolicy.BROWSER_COMPATIBILITY in case cookie handling does not work
        HttpClient client = new HttpClient();
        try {
          Protocol.registerProtocol("https", new Protocol("https", (ProtocolSocketFactory) new EasySSLProtocolSocketFactory(), 443));
          //client.getHostConfiguration().setHost(root_url, 80, "http");
          client.getHostConfiguration().setHost(root_url, 443, "https");
          client.getParams().setCookiePolicy(cookie_policy);
        } catch (java.security.GeneralSecurityException e){
          e.printStackTrace();
        } catch (IOException e){
          e.printStackTrace();
        }
        client.getParams().setCookiePolicy(cookie_policy);
       
        // retrieve session id
        GetMethod authget = new GetMethod("/login/");
        try {
            int code=client.executeMethod(authget);
        } catch (IOException ex) {
            System.err.println("Error: couldn't retrieve session ID from Sketch Engine server.");
            System.exit(1);
        }
        authget.releaseConnection();
       
        // login   
        PostMethod authpost = new PostMethod("/login/");
        NameValuePair submit   = new NameValuePair("submit", "ok");
        NameValuePair username = new NameValuePair("username", ske_username);
        NameValuePair password = new NameValuePair("password", ske_password);
        authpost.setRequestBody(new NameValuePair[] {submit, username, password});
           try {
             int code=client.executeMethod(authpost);
        } catch (IOException ex) {
            System.err.println("Error: couldn't login to Sketch Engine server.");
            System.exit(2);
        }
        authpost.releaseConnection();

        // retrieve data
        for (int i = 0; i < qlist_size; i++) {
            try {
                attrs.put("q", "q" + query_list[i]);
                JSONObject json_query = new JSONObject(attrs);
                String url_string = base_url + method + "?json=" + json_query.toString();
                GetMethod getJSON = new GetMethod(new URI(url_string, false).toString());
                client.executeMethod(getJSON);
                JSONObject json = new JSONObject(new BufferedReader(new InputStreamReader (getJSON.getResponseBodyAsStream())).readLine());
                System.out.println(query_list[i] + "t" + json.get("concsize").toString());
                getJSON.releaseConnection();
            } catch (URIException ex) {
                System.err.println("Error: malformed URI in request.");
            } catch (JSONException ex) {
                System.err.println("Error: malformed JSON format.");
            } catch (IOException ex) {
                System.err.println("Error: couldn't retrieve JSON data from Sketch Engine server.");
            }
        }
    }
}

See the API examples for the tool cURL (in command line). In these we query BNC corpus for a wordlist. We can send a blacklist along the query to the server with words which will be filtered out from the results. One example is with authentication using a cookie file, the second example is using API key which you get in personal settings.

API examples for curl

cat >bl.wl << EOF
and
the
of
in
on
at
a
an
to
that
is
EOF

curl -c cookie.txt 
    -i 
    -F "username=USERNAME" 
    -F "password=PASSWORD" 
    "https://app.sketchengine.co.uk/login/" > /dev/null

curl -b cookie.txt 
    -i 
    -X POST 
    -F "corpname=bnc2_tt21" 
    -F "wlsort=f" 
    -F "wlattr=word" 
    -F "format=json" 
    -H "Content-Type: multipart/form-data" 
    -F "wlblacklist=@bl.wl" 
    "https://app.sketchengine.co.uk/corpus/wordlist" > result.json

cat result.json

Download the curl API example as TXT.

#!/bin/bash

cat >bl.wl << EOF
and
the
of
in
on
at
a
an
to
that
is
EOF

curl -F "corpname=preloaded/bnc2_tt21" 
    -F "wlsort=f" 
    -F "wlattr=word" 
    -F "format=json" 
    -F "api_key=APIKEY" 
    -F "username=USERNAME" 
    -H "Content-Type: multipart/form-data" 
    -F "wlblacklist=@bl.wl" 
    "https://api.sketchengine.co.uk/corpus/wordlist" > result.json

cat result.json

Download the curl API key example as TXT.

Discrepancies between API and interface results

When you query a corpus in the web interface you may notice that the result is obtained very quickly even for quite large corpora. It is possible only with asynchronous processing of the query: you see instantly (only) part of the result while the rest being computed asynchronously in the background. It is indicated by the growing number of hits on the result page. Once the query is fully processed the counting stops. But if you query a corpus via API you don’t usually want SkE to behave like this so you can disable it by putting parameter async=0 into the URL. See the documentation of view method.

WARNING!

Exceeding the limit of API requests

Our API servers apply FUP (Fair usage policy) which is defined in our Service level agreement. If you happen to exceed the quotas, our server will respond with an error HTTP 429 Too many requests. If you require JSON output, it is ignored and an HTML response together with the HTTP error is returned. You should be able to react to this situation and stop your API script since all further request ends up with HTTP 429. It is advisable to increase the interval between queries.

It depends on how many calls you will request, but a simple rule might be:

  • if you want to make fewer than 50 requests, you don’t need to use any waiting,
  • if you want to make up to 900 requests, you need to use the interval of 4 seconds per query,
  • if you want to make more than 2000 requests, you need to use interval ca 44 seconds.

If the queries take some time, you may decrease the interval.

If you exceeded the limit of API requests (mentioned in Service level agreement), you can use our testing account with the following details:

  • login: api_testing
  • api_key: YNSC0B9OXN57XB48T9HWUFFLPY4TZ6OE

Available methods and attributes

See a list of all methods and attributes available in Sketch Engine.