UNPC: The United Nations Parallel Corpus

The United Nations Parallel Corpus (UNPC) is a collection of six parallel corpora made up from official records and other parliamentary documents of the United Nations. Most of the documents are available in all six official languages of the United Nations. The content of the corpus consists of manually translated documents between 1990 and 2014 and texts are aligned on the sentence level.

The list of languages in the collection of the UNPC corpora includes: Arabic, Chinese (Simplified script), English, French, Russian and Spanish.

Number of aligned tokens for each pair of languages

Arabic Chinese (Simplified) English French Russian Spanish
Arabic 517,150,223 587,231,539 596,881,951 600,601,791 601,362,359
Chinese (Simplified) 434,987,326 428,869,617 434,499,039 440,768,507 435,260,238
English 550,685,401 475,126,683 753,658,350 644,825,629 657,633,158
French 678,835,378 586,645,545 904,326,817 787,712,855 807,972,655
Russian 537,398,056 468,634,178 615,852,162 621,182,081 565,543,530
Spanish 637,090,563 547,013,982 739,463,681 753,873,421 668,790,072

The United Nations parallel corpora have the sentence alignment and you can search and analyze monolingually (as a standard single corpus) or multilingually (as parallel corpora). The data were gained from the OPUS project that is maintained by Joerg Tiederman. We process the texts in terms of lemmatization and part-of-speech tagging including word sketches and term extraction. The incorrect alignments were fixed.

Tools to work with the United Nations Parallel corpus

A complete set of tools is available to work with the multilingual UNPC corpora to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives, etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. [pdf] In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012).

Pierre Lison and Jörg Tiedemann, 2016 OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. [pdf] In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.

The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):

  • The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus.
  • Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the user’s sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the user’s sole and exclusive remedy is to discontinue using the United Nations Corpus.
  • When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
  • Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved.

Search the United Nations Parallel Corpus

Sketch Engine offers a range of tools to work with the UNPC parallel corpora.

or

Tip

Learn to work with multilingual and parallel corpora in Sketch Engine. Find more in our user guide.

More parallel corpora

EUR-Lex Corpora – texts from the EUR-Lex database containing public EU documents

Eur-Lex judgments corpus – judgments of the Court of Justice of the European Union

OPUS 2 parallel corpora – a collection of translated texts from the web

DGT-Translation Memory corpora – European Union’s legislative documents

Europarl: European Parliament Proceedings Corpora – transcriptions of the European Parliament Proceedings

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.