Build parallel and multilingual corpora

Parallel corpora can be built from:

non-aligned texts

in common document formats

aligned texts

in a tabular format, e.g. .xlsx or .tmx

vertical file 1:1

expert users with 1:1 mapping

vertical file M:N

expert users with M:N mapping

Use the corresponding tab below for more information.

from non-aligned documents

Parallel corpus from non-aligned documents

Parallel corpora can be built from non-aligned texts in common document formats, e.g. from two PDFs where one is the translation of the other. The supported formats are: .doc, .docx, .htm, .html, .pdf, .txt

This method only supports 2 languages. If your parallel corpus has more languages, an external tool or a manual procedure should be used for the alignment.

Automatic alignment

After uploading, the documents will be converted into plain text and aligned automatically at the sentence level and processed into a parallel corpus. The whole process does not require intervention by the user.

For best results

The documents must be translations of one another. (Not random texts about a similar topic.)
Documents containing only text in one column produce best result.
Documents with complex design such as advertisements, promotional leaflets or posters may be impossible to align and produce a poor result.

How to build a parallel corpus from documents

go to DASHBOARD
click NEW CORPUS
type a name and click Multilingual corpus, click NEXT
click NON-ALIGNED DOCUMENTS
select the languages and type the names of the corpora, for the sake of practicality, use the same name for both languages
upload the documents, multiple documents are supported but they must be uploaded in the same order in both languages
click NEXT and wait for the corpus to be processed and compiled

Correct alignment errors

Alignment errors cannot be corrected in Sketch Engine. If they are too many, they have to be corrected outside Sketch Engine. Download the corpus in one of the available formats, e.g. XLSX and use Excel or Google Sheets to correct the alignment.

Analyse the corpus

Learn to work with the parallel corpus on our YouTube channel or in this guide.

Add more data

The above procedure cannot be used to make an existing corpus bigger. It does nowt allow adding new data. Instead, build a new corpus, download it and add it to the first corpus using the same procedure as the one used for aligned data.

Refine the corpus

Read our documentation on fine-tuning corpora to improve the use of your corpus.

from aligned texts

Parallel corpus from tabular data

The simplest way to build a parallel corpus is to upload data in one of these formats: Excel spreadsheet (XLS, XLSX), or a translation memory format (TMX or XLIFF).

Data format

Your data must be in the format described on Data format for a parallel corpus .

Upload your data

If your parallel data are in the required format, follow these steps to build your corpus:

on the corpus dashboard, click NEW CORPUS

click MULTILINGUAL, then ALIGNED DOCUMENTS
type the corpus name and select the file
- other supported formats: XLIFF (v. 2.0 and higher), TSV, XLSX
  (if XLSX does not upload correctly, try opening the file in Excel and save as Excel 97-2003 Workbook)
on the next screen, check the languages are correct
click Next
wait for the corpus to be processed, you can leave the screen and let the process finish in the background. You will find your corpus in My corpora.

Each language in the source file will be processed into a separate monolingual corpus and aligned with the corresponding corpus in the other language(s).

How to search

Refer to Parallel concordance lesson to learn to search and analyse parallel corpora.

Add more texts - make the corpus bigger

Additional data must be added separately to each language. The file containing aligned data has to be uploaded as many times as there are languages in the corpus. It cannot be done in one step. Each time, Sketch Engine will extract only one language corresponding to the language of the selected corpus.

select the first language of the multilingual corpus
go to DASHBOARD
click MANAGE CORPUS, then MAKE BIGGER
click I have my own texts
upload the file and see if the processing finished successfully
if not, Sketch Engine couldn't correctly guess the appropriate language code. In that case, click the file name, then the gear icon, select language code manually and submit the form. Now the file should be processed correctly.
proceed to the next screens to compile the corpus
when the corpus is compiled, select the second language of the multilingual corpus and follow the same procedure to upload the data in the second language

This procedure has to be repeated as many times as there are languages in the corpus.

vertical file with 1:1 mapping

1:1 mapping expert users

In addition to the basic method (which also produces corpora mapped 1:1), parallel corpora can also be built from other sources including vertical files. Sketch Engine supports both 1:1 and m:n mapping. Each language of a parallel corpus can be searched individually as a monolingual corpus or as aligned to one or more corpora (languages).

1:1 mapping

1:1 mapping is a type of alignment where all aligned corpora have the exact same number of aligned structures. Typically, the same number of sentences or paragraphs, i.e. each sentence in a corpus has a matching sentence in the other corpus.

Data preparation

It is a requirement that an alignment structure is present in the corpus. By default, the corpora will be aligned by the align structure. A different alignment structure already present in the corpus (e.g. sentence or paragraph) can be set with the ALIGNSTRUCT corpus attribute.

Here is an example of two source vertical files suitable for processing into parallel corpora. Each contains two sentences.

Corpus 1

Corpus 2

A continuous flowing text can also be uploaded provided the structures are present.

Corpus 1

Corpus 2

Using the web interface to create a parallel corpus.

log in to Sketch Engine
create two (or more) corpora, make sure all of them contain the same alignment structure, e.g. and
set the alignment - select the corpus, click Manage corpus
1. 1. 1. Change corpus configuration via the Configure tab and confirm "I am an expert"
    2. Add following two lines at the end of your corpus configuration
  ALIGNSTRUCT "structure"
  ALIGNED "name_of_aligned_corpus_1,name_of_aligned_corpus_2"
  1. click save
Repeat step 3 for all aligned corpora in the set

Example

Parallel corpora in English, German and Spanish will be uploaded. The corpora will be aligned using structure in the source data.

1. Create three corpora, one in each language.

2. If each corpus consists of multiple files, make sure the alphabetical order of the corresponding files is the same in all corpora, i.e. the first English file must correspond to the first German file and the first Spanish file, the second file to the second files, etc. It may be practical to prefix the file names with a number to avoid aligning incorrect segments.

files	English	German	Spanish
first	01_dog.txt	01_Hund.txt	01_perro.txt
second	02_care.txt	02_Pflege.txt	02_cuidado.txt

3. Make sure the source data contain structure align to mark segments. No segment can be omitted. The order of segments must be the same in all aligned corpora. The structure must be added to the files before uploading them.

You can also use an alignment software such as hunalign. Manual correction of the output might be necessary.

English – 01_dog.txt

German – 01_Hund.txt

Spanish – 01_perro.txt

4. Upload the source files into the corpora via option "I have my own texts".

5. Open the corpus configuration of your corpus (Select corpus > Manage corpus > Configure).

6. Set the alignment – add the two lines starting with ALIGNSTRUCT and ALIGNED at the end of your corpus configuration

7. Click save.

8. Recompile all three corpora.

9. Open any of the corpora, the Parallel concordance tab will be available on Dashboard of your corpus.

Attachment

Download: helper script for parallel corpora

Defining aligned corpora via the configuration file

Apart from the user interface, aligned corpora can also be defined via the configuration file. Two new lines must be added into the corpus configuration file of each of the aligned corpora. The first one is

Line 1 is declaration of the align structure:

STRUCTURE align

since manatee 2.67 An existing structure can be set as the alignment structure using the ALIGNSTRUCT attribute:

ALIGNSTRUCT "s"

Line 2 is the list of IDs of all corpora that are aligned with the corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

With this setting, Sketch Engine will identify the aligned corpora.

vertical file with M:N mapping

m:n mapping expert users

since manatee 2.67

Starting with Manatee version 2.67, m:n (incl. m:0) alignment is supported. The name of the alignment structure shuld be defined in the ALIGNSTRUCT in the corpus configuration file (registry). It defaults to align.

Data preparation

To use the m:n mapping, a file with mapping definition for each pair of corpora has to be prepared. The file consists of two tab-separated columns, each containing one of the following:

A, B – two non-negative integers A, B separated by a comma denoting a range of aligned structures from A (inclusive) to B (inclusive).
A – one non-negative integer A denoting a single aligned structure
-1 – denoting an empty alignment
A:B – two non-negative integers separated by a colon denoting a range of consecutive aligned structures aligned 1:1; the corresponding range must exist in the second column

Numbers A and B (structure IDs) refer to the order in which aligned structures appear in the vertical file and have been indexed. To get the structure IDs, e.g. from a unique attribute of the structure, the str2id() method of the PosAttr object can be used. See the example below for the corpus "test", aligned structure "s" and its attribute "id":

import manatee
c = manatee.Corpus("test")
a = c.get_attr("s.id")
id = a.str2id("")

A sample mapping file may look as follows: (structures are numbered starting with zero)

 0	 0	(the 1st structure of corpus 1 is aligned to the 1st structure of corpus 2)
  1	 1,3	(the 2nd structure in corpus 1 is aligned to both the 2nd and 4th structure of corpus 2)
-1       2,4    (the 3rd and 5th structure in corpus 2 is not aligned to any structure in corpus 1)
 2:4	 5	(the 3rd, 4th and 5th structures in corpus 1 are all aligned to the 6th structure in corpus 2)
 5	-1	(the 6th structure in corpus 1 is not aligned to any structure in corpus 2)
 6,8	 6,8	(the 7th structure and 9th structure in corpus 1 are both aligned to the 7th and 9th structures in corpus 2)
 7       7      (the 8th structure in corpus 1 is not aligned to the 7th structure in corpus 2)
 9:11    9:11	(the 10th structure is aligned to the 10th structure, 11th to 11th and 12th to 12th)

The colon compresses consecutive lines with 1:1 alignment. This:

9:11   9:11

is the abbreviation for this:

9    9
10   10
11   11

Also, note that all structures in both corpora must be covered by the mapping.

Changes in corpus configuration for m:n mapping

First, you need to set ALIGNSTRUCT to your mapping structure (if it is not "aligned"), e.g.:

ALIGNSTRUCT "s"

Then you define which corpora are aligned with this corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

And finally, you provide a mapping definition file for each of this corpus:

ALIGNDEF "/path/to/mapping/file/for/aligned_corpus_id_1,/path/to/mapping/file/for/aligned_corpus_id_2"

ALIGNED and ALIGNDEF must contain the same number of comma-delimited items in the right order (the first item in ALIGNDEF is the definition file with mapping to the first corpus in ALIGNED etc.).

Compilation of corpora with m:n mapping

If you set ALIGNED and ALIGNDEF properly, compilecorp will compile all necessary indices for you. Alternatively, you may manually compile the index for each pair of aligned corpora by running:

where

says where the new index is going to be built and should be set to the

It is advisable to compile all aligned corpora first with --no-align and then again without this parameter.

Helper scripts

These help scripts can be useful when creating the alignment definition files.

download

back to Guide

non-aligned texts

aligned texts

vertical file 1:1

vertical file M:N

Parallel corpus from non-aligned documents

Automatic alignment

For best results

How to build a parallel corpus from documents

Correct alignment errors

Analyse the corpus

Add more data

Refine the corpus

Parallel corpus from tabular data

Data format

Upload your data

How to search

Add more texts - make the corpus bigger

1:1 mapping expert users

1:1 mapping

Data preparation

Using the web interface to create a parallel corpus.

Example

Attachment

Defining aligned corpora via the configuration file

m:n mapping expert users

Data preparation

Changes in corpus configuration for m:n mapping

Compilation of corpora with m:n mapping

Helper scripts

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

Setting up parallel and multilingual corpora

Parallel corpus from non-aligned documents

Automatic alignment

For best results

How to build a parallel corpus from documents

Correct alignment errors

Analyse the corpus

Add more data

Refine the corpus

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine