What is a subcorpus?

Each corpus can be divided into smaller parts called subcorpora. Subcorpora can be used to divide the corpus by the type of text(fiction, newspaper), media (spoken, written), time (e.g. publication year) or by any other criteria. Subcorpora can be overlapping, the same segment can appear in all subcorpora whose criteria it meets.

All tools, except the thesaurus, can be set to analyse a subcorpus instead of the whole corpus.

Lock a subcorpus

Normally, a subcorpus has to be selected each time the user needs to search it. When working with the same subcorpus for an extended period of time, the corpus can be locked. When locked, all tools will have this subcorpus preselected. The user does not need to select it each time. It will stay locked until the user unlocks it gain.

How to create a subcorpus?

A corpus can be divided into subcorpora using text types or from a concordance. The third option using a configuration file is for advanced users and allows for detailed specifications. This page explains the first two options. Subcorpora are only available to the user who created them. Expert users can use the configuration file to share subcorpora with all users.

OPTION 1 – subcorpus from text types

This procedure will create a subcorpus from text types. This option can only be used if the corpus is annotated for text types.

The subcorpus building screen can be reached in two ways:

From the dashboard

On the corpus dashboard, click MANAGE CORPUS, then SUBCORPORA, then CREATE SUBCORPUS

From the advanced tab of any tool

On the advanced tab of any tool (with the exception of the thesaurus), click the plus sign add next to the subcorpus selector.

create a new subcorpus

  • type a name for your new subcorpus
  • use the text type selectors to choose the required text types

Creating a subcorpus may take a few seconds while statistics for the subcorpus are calculated. Watch for a notification. When finished, it will appear in the subcorpus selector on advanced tabs of all tools that support subcorpora.

Tips for using the text type selectors
  • you can select as many text types from as many selectors  as you wish
  • selecting all values in a selector is the same as selecting none
  • type a few letters to search the text types
  • the selector can be expanded full-screen for practicality

OPTION 2 – subcorpus from a concordance

A subcorpus can be created from concordance lines. The user generates a concordance and decides how much context surrounding the KWIC should be included in the subcorpus. This context is defined by documents, the paragraphs or only the sentences containing the KWIC. Other structures can be selected too if the corpus contains them.

  • Open a corpus and generate a concordance.
    Click the plus icon add in the menu above the concordance.
  • Type a name for your new subcorpus.
  • Indicate how much context should be included in the subcorpus by selecting the structure surrounding the KWIC. The available structures can differ betewen corpora. These are the most common ones:
    doc – the whole document (produces big subcorpora)
    p – the whole paragraph
    s – the whole sentence (produces small subcorpora)
    For a detailed description of the structures used in the corpus, see corpus information.

It may take a few seconds for the corpus to be built while the statistics are calculated. Watch for a notification. When finished, the subcorpus is available in the subcorpus selector on the advanced tabs of tools which support subcorpora.

Delete a subcorpus

When deleting a subcorpus, only the subcorpus definition is deleted, no data are removed from the corpus. Users can only delete the subcorpora they created. Subcorpora supplied with preloaded corpora cannot be deleted.

To delete a subcorpus

  • go to the corpus dasboard
  • click  SUBCORPORA
  • delete the subcorpus