What is a virtual corpus?

A virtual corpus is a corpus that is compiled from other corpora or corpus parts (i.e. subcorpora). In other words, the sources of the virtual corpus are verticals of the respective corpora or their parts.

The virtual corpus functionality is available from Manatee version 2.88.

Why use virtual corpora?

If you need to put together various corpora or subcorpora, you may build a virtual corpus. It is easier and faster to prepare and set up the virtual corpus according to source vertical files than to create a new one. The resulting virtual corpus takes only a fraction of disk space of what its non-virtual corpus counterpart would.

How to set up a virtual corpus?

1. Create a virtual corpus definition file

A virtual corpus definition file is a plain text file specifying which corpora will be used to create the virtual corpus. It consists of a list of parts, each part having the following format:

=<CORPUS_NAME>
<from_position>,<end_position>
<from_position>,<end_position>


This says that a part of the corpus should be included, starting with (inclusively) and ending with (exclusively). The dollar sign (‘$’) can be put instead of denoting the end of the corpus.

Example:

=bnc
1000,2500
3500,4500

=susanne
0,$

Virtual corpus using this definition file would consist of the whole “susanne” corpus and two parts of the “bnc” corpus (tokens 1000–2500 and 3500–4500).

2. Create a virtual corpus configuration file

This is the configuration file as in the case of non-virtual corpora with a couple of specifics:

  • You can start by amending some of the configuration files of those corpora that the virtual corpus consist of, however, you have to make sure that all attributes and structures specified in the virtual corpus configuration file are present in ALL parts of it.
  • Instead of specifying the input vertical source file by the VERTICAL attribute, use the VIRTUAL attribute which should contain the full path to the virtual corpus definition file created in step 1.

Example:

NAME "Susanne + Bnc"
PATH /corpora/manatee/virtual_english
VIRTUAL /corpora/virtdef/virtual_english # this is the virtual corpus definition file

ATTRIBUTE word
ATTRIBUTE tag
ATTRIBUTE lemma
...

3. Compile the virtual corpus

You can use the compilecorp wrapper script as usual – it will detect the VIRTUAL attribute and automatically use mkvirt utility instead of encodevert to compile the corpus.