Who is this page for?

This page is useful to you if you want to create:
  • many subcorpora manually
  • many subcorpora automatically
  • subcorpora based on complex criteria
  • subcorpora which other users can also use

Such corpora must be built using a subcorpus definition file. Unlike a subcorpus built using the standard procedure, all corpora built using a definition file are also accessible to all users whom you granted access to your corpus via the Share corpus function.

Subcorpus definition file

The subcorpus definition file is a normal text file with a specific structure indicating the name of the subcorpus and the criteria to build it. A corpus can only have one definition file, but the same file can be used in more corpora. One definition file can contain definitions of an unlimited number of subcorpora.

Syntax

The subcorpus definition starts with ‘=’ followed by the subcorpus name. The next line contains text types or CQL which serve as the criteria. Lines starting with # are comments and will be ignored. The line with *FREQLISTATTRS contains a list of positional attributes for which the subcorpus frequencies will be pre-computed.

For text types, it is necessary to specify a structure name on the second line and the structure attribute on the third line. The list of corpus structures and their attributes can be checked on the corpus info page.

This example will build a subcorpus called “Year 2012” from all documents that contain metadata specifying the publication date (attribute pub_year) and its content is 2012. The attribute names are those that you used when you built your corpus. Check your corpus info page.

=Year 2012
    doc
    pub_year="2012"

This definition will build a subcorpus called “Survey” from the documents whose attribute filename contains the word “survey”.

=Survey
    doc
    filename=".*survey.*"

Apply the definition

You can apply the definition to your corpus:

recommended for most users

Start on the Dashboard and then:

  • MANAGE CORPUS
  • Configure
  • Expert Settings
  • Subcorpus definition
  • Type or paste your definitions in Subcorpus definition
  • Save and Compile.

After compilation, the subcorpora will be available in the subcorpus selectors in the interface. They will also be in  ‘Manage Corpus – Subcorpora’ and on the corpus information page. Anyone you share the corpus with will also have access to them.

An example of subcorpus definition file

###############################################################################
# Subcorpus definition file
###############################################################################
#
# Subcorpora created using a definition file are available to all users 
# with access to the corpus. Subcorpora created using other ways are only
# available to the corpus owner.
#
# Subcorpus definition format
# ----------------------------
# *FREQLISTATTRS attr1 attr2
#
# =subcorpus_id
# structure
# sub-query
#
# =subcorpus_id
# -CQL-
# full-cql-query
#
# FREQLISTATTRS specifies a list of attributes for which frequecy
# lists should be precomputed.
#
# Sub-query is a part of a corpus query which can be used in
# "within" clause. It can consist of and/or combination
# of attribute-value pairs.
#
# Full-cql-query is any CQL query whose result (KWIC) is taken as subcorpus
# definition.
#
# All strings starting with # are comments and are ignored to the end of the line.
#
###############################################################################

*FREQLISTATTRS word  lemma   lempos

=EU domain .eu
   doc
   tld="eu"

=Genre %s
   doc
   *genre 1%

#all sentences with a question mark at the end
=Questions
   -CQL-
   < s/> containing [word="\?"]