UNPC: The United Nations Parallel Corpus
The United Nations Parallel Corpus (UNPC) is a collection of six parallel corpora made up from official records and other parliamentary documents of the United Nations. Most of the documents are available in all six official languages of the United Nations. The content of the corpus consists of manually translated documents between 1990 and 2014 and texts are aligned on the sentence level.
The list of languages in the collection of the UNPC corpora includes: Arabic, Chinese (Simplified script), English, French, Russian and Spanish.
Number of aligned tokens for each pair of languages
The United Nations parallel corpora have the sentence alignment and you can search and analyze monolingually (as a standard single corpus) or multilingually (as parallel corpora). The data were gained from the OPUS project that is maintained by Joerg Tiederman. We process the texts in terms of lemmatization and part-of-speech tagging including word sketches and term extraction. The incorrect alignments were fixed.
Tools to work with the United Nations Parallel corpus
A complete set of tools is available to work with the multilingual UNPC corpora to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives, etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Bibliography & citation
Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. [pdf] In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012).
Pierre Lison and Jörg Tiedemann, 2016 OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. [pdf] In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):
- The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus.
- Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the user’s sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the user’s sole and exclusive remedy is to discontinue using the United Nations Corpus.
- When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
- Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved.
Search the United Nations Parallel Corpus
Sketch Engine offers a range of tools to work with the UNPC parallel corpora.
Learn to work with multilingual and parallel corpora in Sketch Engine. Find more in our user guide.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.