In addition to the basic procedure (which also produces corpora mapped 1:1), parallel corpora can also be created from other sources including vertical files. Sketch Engine supports both 1:1 and m:n mapping. Each language of a parallel corpus can be searched individually as a monolingual corpus or as aligned to one or more corpora (languages).
1:1 mapping is a type of alignment where all aligned corpora have the exact same number of aligned structures. Typically, the same number of sentences or paragraphs, i.e. each sentence in a corpus has a matching sentence in the other corpus.
It is a requirement that an alignment structure is present in the corpus. By default, the corpora will be aligned by the
align structure. A different alignment structure already present in the corpus (e.g. sentence or paragraph) can be set with the ALIGNSTRUCT corpus attribute.
Here is an example of two source vertical files suitable for processing into parallel corpora. Each contains two sentences.
A continuous flowing text can also be uploaded provided the structures are present.
Using the web interface to create a parallel corpus.
- log in to Sketch Engine
- create two (or more) corpora, make sure all of them contain the same alignment structure, e.g.
- set the alignment
select the corpus, click Manage corpus, then Configure corpus in the sidebar, tick all corpora which should be aligned and save
- repeat step 3 for all aligned corpora in the set
If the alignment structure is not <align>, edit the corpus configuration like this:
- select the corpus, click Manage corpus, then turn on the expert mode
- add the following line into the corpus configuration file
(use the actual structure name) and save the form.
Parallel corpora in English, German and Spanish will be uploaded. The corpora will be aligned using structure in the source data.
1. Create three corpora, one in each language.
2. If each corpus consists of multiple files, make sure the alphabetical order of the corresponding files is the same in all corpora, i.e. the first English file must correspond to the first German file and the first Spanish file, the second file to the second files, etc. It may be practical to prefix the file names with a number to avoid aligning incorrect segments.
3. Make sure the source data contain structure align to mark segments. No segment can be omitted. The order of segments must be the same in all aligned corpora. The structure must be added to the files before uploading them..
You can also use an alignment software such as hunalign. A manual correction of the output might be necessary.
English – 01_dog.txt
German – 01_Hund.txt
4. Upload the source files into the corpora.
5. The corresponding align segments in data from all corpora will be automatically connected: the first together, the second together, etc.
6. Set the alignment – align each corpus to all other corpora in the set. (Manage corpus - Configure corpus)
7. Recompile all three corpora.
8. Open any of the corpora, the search form will offer the aligned corpora. Select one or more.
Download: helper script for parallel corpora
Defining aligned corpora via the configuration file
Apart from the user interface, aligned corpora can also be defined via the configuration file. Two new lines must be added into the corpus configuration file of each of the aligned corpora. The first one is
Line 1 is is declaration of the align structure:
since manatee 2.67 An existing structure can be set as the alignment structure using the ALIGNSTRUCT attribute:
Line 2 is the list of IDs of all corpora that are aligned with the corpus:
With this setting, Sketch Engine will identify the aligned corpora.