This page is about an advanced way of building subcorpora in Sketch Engine using a definition file. To learn about basic subcorpus building, please read Create a subcorpus

The definition file is a text file which contains information about how one or more corpora should be created. Only one definition file is allowed per corpus. The file can contain definitions of many subcorpora.

Basic syntax

The definition must start with ‘=’ followed by the name of the subcorpus. Nothing else should appear on that line.

The subcorpus definition follows on the next line. A subcorpus can be defined in two ways:

  • with text types
  • with CQL

Definition with text types

The second line should contain the name of a structure. The third line should contain text types (metadata) attached to this structure. This definition will create two subcorpora:

  • example1 is a subcorpus  made up from documents whose publication year is 2012
  • example2 is a subcorpus of documents which were created from uploaded files whose filename starts with capital K (regular expression is used)

The structures and text types attached to them can be checked on the corpus info page.

=example1
    doc
    pub_year="2012"

=example2
    doc
    filename="K.*"

Definition with CQL

A definition with CQL must have ‘-CQL-‘ on line 2. Line 3 should contain a CQL query. The subcorpus will contain what would appear as KWIC if the query was used in the concordance. No other context will be included unless the CQL query explicitly includes it. Therefore, the use of the containing operator may be needed. This definition will create 4 subcorpora:

  • example4 – all sequences of 2 nouns will be included, the subcorpus will only contain nouns, nothing else. This is probably not very useful.
  • example5 – all sentences containing a sequence of 2 nouns will be included in the subcorpus.
  • example6 – all documents containing a sequence of 2 nouns will be included.
  • example7 — all documents published in 2012 which contain a sequence of 3 nouns will be included in the subcorpus.
=example4
-CQL-
[tag="N.*"] [tag="N.*"]

=example5
-CQL-
<s/> containing [tag="N.*"] [tag="N.*"] 

=example6
-CQL-
<doc/> containing [tag="N.*"] [tag="N.*"] 

=example7
-CQL-
<doc pub_date="2012"/> containing [tag="N.*"] [tag="N.*"]</s></pre><pre class="wiki">

###############################################################################
# Subcorpus definition file
###############################################################################
#
# Subcorpora created using a definition file are available to all users 
# with access to the corpus. Subcorpora created using other ways are only
# available to the corpus owner.
#
# Subcorpus definition format
# ----------------------------
# *FREQLISTATTRS attr1 attr2
#
# =subcorpus_id
#   structure
#   sub-query
#
# =subcorpus_id
#   -CQL-
#   full-cql-query
#
# FREQLISTATTRS specifies a list of attributes for which frequecy
# lists should be precomputed.
#
# Sub-query is a part of a corpus query which can be used in
# "within " clause.  It can consist of and/or combination
# of attribute-value pairs.
#
# Full-cql-query is any CQL query whose result (KWIC) is taken as subcorpus
# definition.
#
# All strings starting with # are comments and are ignored to the end of line.
#
###############################################################################

*FREQLISTATTRS word lemma lempos

=spoken
  bncdoc
  alltyp="Spoken context-governed" | alltyp="Spoken demographic"


=book60
  bncdoc
  alltim="1960-1974" & wrimed="Book"


=first1000
  -CQL-
  [#0-1000]


=same_as_book60
  -CQL-
  

Applying the definition

There are two ways of applying the subcorpus definition to the corpus:

  • via the web interface – recommended for most users
  • with a script – only for system admins

Via the interface

Follow these steps. Start on the Dashboard dashboard and follow these steps:

  • MANAGE CORPUS
  • Configure
  • Expert Settings
  • Subcorpus definition
  • Copy and paste the text of the definition line on the line. The line will expand as necesary.
  • Save and Compile.

When the compilation finishes, the subcorpora will appear in the subcorpus selector on the input forms, also in Manage corpus – Subcorpora and on the corpus info page. The subcorpora will also be available to all users you share the corpus with.

With the mksubc.py script

(for system admins only)

Usage: mksubc.py CORPNAME SUBCORP_DIR SUBCORP_DEF_FILE

SUBCORP_DIR is a directory where the subcorpora will be created, this depends on the Sketch Engine installation. The global subcorpora (accessible by all users) should be stored in the directory set in the SUBCBASE attribute of the corpus config file, which is by default PATH/subcorp/.

Note that mksubc.py is run by compilecorp (see Compiling Corpus)

When is this useful?

Building a subcorpus using a definition file is useful in these situations:

subcorpus sharing
Subcorpora are normally only available to the owner of the corpus.  Subcorpora built via the definition file will be available to the users who you share the corpus with.

lots of subcorpora (with minimal variation)
Although you can build any number of subcorpora from the concordance or from text types, the process of clicking in the interface can be tedious especially if there are minimal differences between the subcorpora.

mobility
If you have two or more corpora and you want their subcorpora to built by exactly the same criteria, the corpus definition file makes it possible to simply copy the definition from one corpus to another ensuring the subcorpora are based on identical criteria.

subcorpus tuning
You want to be able to improve an existing subcorpus by repeatedly adapting its definition.

Other ways of building subcorpora