Each corpus can (but does not have to) be divided into smaller parts called subcorpora. Subcorpora can be used to divide the corpus by the type (fiction, newspaper), media (spoken, written) or time (e.g. by years) or by any other criteria. Subcorpora can be overlapping, the same segment can appear in all subcorpora it belongs to.
All tools in Sketch Engine with the exception of the thesaurus can make use of subcorpora by searching only one subcorpus of the corpus or by providing statistics of the same phenomenon in different subcorpora, e.g. in written vs. spoken language or in fiction vs. newspaper.
How to create a subcorpus?
A corpus can be divided into subcorpora using text types or from a concordance. A third option using a configuration file is intended for detailed specifications for advanced users. This page explains the first two options. Subcorpora are only available to the user who created them. Expert users can use the configuration file to share subcorpora with all users.
This procedure will create a subcorpus from text types. This option can only be used if the corpus is annotated for text types.
The subcorpus building screen can be reached in two ways:
From the dashboard
On the corpus dashboard, click MANAGE CORPUS, then SUBCORPORA, then CREATE SUBCORPUS
From the advanced tab of any tool
On the advanced tab of any tool (with the exception of the thesaurus), click the plus sign add next to the subcorpus selector.
type a name for your new subcorpus
use the text type selectors to choose the required text types
click CREATE SUBCORPUS
Creating a subcorpus may take a few seconds while statistics for the subcorpus are calculated. Watch for a notification. When finished, it will appear in the subcorpus selector on advanced tabs of all tools that support subcorpora.
Tips for using the text type selectors
you can select as many text types from as many selectors as you wish
selecting all values in a selector is the same as selecting none
type a few letters to search the text types
the selector can be expanded full-screen for practicality
OPTION 2 – subcorpus from a concordance
A subcorpus can be created from concordance lines. The user generates a concordance and decides how much context surrounding the KWIC should be included in the subcorpus. This context is defined by documents, the paragraphs or only the sentences containing the KWIC. Other structures can be selected too if the corpus contains them.
Open a corpus and generate a concordance.
Click the plus icon add in the menu above the concordance.
Type a name for your new subcorpus.
Indicate how much context should be included in the subcorpus by selecting the structure surrounding the KWIC. The available structures differ from corpus to corpus but usually:
doc – the whole document (produces big subcorpora)
p – the whole paragraph
s – the whole sentence (produces small subcorpora)
For a detailed description of the structures used in the corpus, see corpus information.
Click CREATE SUBCORPUS
It may take a few seconds for the corpus to be built while the statistics are calculated. Watch for a notification. When finished, the subcorpus is available in the subcorpus selector on the advanced tabs of tools which support subcorpora.
Delete a subcorpus
When deleting a subcorpus, only the subcorpus definition is deleted, no data are removed from the corpus. Users can only delete the subcorpora they created. Subcorpora supplied with preloaded corpora cannot be deleted.