United Nations Parallel Corpus (UNPC)

UNPC: The United Nations Parallel Corpus

The United Nations Parallel Corpus (UNPC) is a collection of six parallel corpora made up from official records and other parliamentary documents of the United Nations. Most of the documents are available in all six official languages of the United Nations. The content of the corpus consists of manually translated documents between 1990 and 2014 and texts are aligned on the sentence level.

The list of languages in the collection of the UNPC corpora includes: Arabic, Chinese (Simplified script), English, French, Russian and Spanish.

Number of aligned tokens for each pair of languages

	Arabic	Chinese (Simplified)	English	French	Russian	Spanish
Arabic		517,150,223	587,231,539	596,881,951	600,601,791	601,362,359
Chinese (Simplified)	434,987,326		428,869,617	434,499,039	440,768,507	435,260,238
English	550,685,401	475,126,683		753,658,350	644,825,629	657,633,158
French	678,835,378	586,645,545	904,326,817		787,712,855	807,972,655
Russian	537,398,056	468,634,178	615,852,162	621,182,081		565,543,530
Spanish	637,090,563	547,013,982	739,463,681	753,873,421	668,790,072

The United Nations parallel corpora have the sentence alignment and you can search and analyze monolingually (as a standard single corpus) or multilingually (as parallel corpora). The data were gained from the OPUS project that is maintained by Joerg Tiederman. We process the texts in terms of lemmatization and part-of-speech tagging including word sketches and term extraction. The incorrect alignments were fixed.

Tools to work with the United Nations Parallel corpus

A complete set of tools is available to work with the multilingual UNPC corpora to generate:

word sketch – collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of nouns, verbs, adjectives, etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Bibliography & citation

Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016

Disclaimer

The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):

The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus.
Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the user’s sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the user’s sole and exclusive remedy is to discontinue using the United Nations Corpus.
When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved.

Search the United Nations Parallel Corpus

Sketch Engine offers a range of tools to work with the UNPC parallel corpora.

open in Sketch Engine

about Sketch Engine

Tip

Learn to work with multilingual and parallel corpora in Sketch Engine. Find more in our user guide.

More parallel corpora

DGT Translation Memory parallel corpora – European Union’s legislative documents

EUR-Lex 2/2016 parallel corpora – texts from the EUR-Lex database containing public EU documents

Eur-Lex judgments 12/2016 parallel corpora – judgments of the Court of Justice of the European Union

Europarl spoken parallel corpora – transcriptions of the European Parliament Proceedings

Open Parallel Corpus (OPUS) – translated texts from various sources, e.g. medical documents, subtitles, technical documentation, etc.

OpenSubtitles 2018 parallel corpora – movie subtitles from the OpenSubtitles database

corpora in Sketch Engine

about Sketch Engine

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

UNPC: The United Nations Parallel Corpus

Number of aligned tokens for each pair of languages

Tools to work with the United Nations Parallel corpus

Search the United Nations Parallel Corpus

Tip

More parallel corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine