Data format for a parallel corpus

This page describes the required data format for building a parallel corpus using the basic method from an Excel spreadsheet.
Metadata will be converted to text types and will appear in the Text type selector. Users can use the selector to only analyse data which fall within the selected text types. It will also be possible to generate statistics of text types.

Data format

The data must be supplied in a tabular format like this:

languages = columns: the spreadsheet must contain as many columns as there are languages: use 2 columns for 2 languages, 3 columns for 3 languages etc. All other columns must stay empty.
English names of the languages in line 1
from line 2 onward, the cells must contain the corpus text (=aligned segments)

If a long L1 (language 1) segment was translated as two short L2 (language 2) segments, cell A should contain one segment and the corresponding cell B two segments. If two L1 segments were translated as one L2 segment, cell A will contain two segments and cell B one segment.

The word ‘segment’ usually refers to a sentence but the corpus can be aligned at different segments, for example short paragraphs.

Example

This is an example of data containing only text without any metadata.

text only

Corpus with text types

If your corpus contains text types (metadata), they need to be included inside the column together with the text. Additional columns are not allowed.

Insert metadata to all languages

To analyse both languages using metadata, the metadata must be inserted in all languages (columns).

Structure tags

Metadata must be included in structure tags which surround the corresponding stretch of text. The names of the structures are up to the corpus author, for example seg or segment or any other name. Avoid using s which is reserved for sentences.

Structure tags can be inserted into the same cell with the text or in the cells above and below the text. Either option will be processed correctly. Download example data on the right.

Metadata format

There is no limit to the number of different structures, attributes and values. It is up to the corpus author to choose their names. See Annotating corpus text for details about the attribute and value format.

Example

These examples show corpus text annotated with metadata.

with metadata (simple)

with metadata (complex)

See also

Data format

Example

Corpus with text types

Insert metadata to all languages

Structure tags

Metadata format

Example

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine