Parallel corpora in Sketch Engine
Finding parallel data suitable for corpora is extremely difficult. Parallel corpora are multi-lingual corpora made from translated texts. The largest volume of translated documents consists of documents such as contracts or legal documents which are often confidential. Software localization also produces lots of translated text, but these texts are rarely useful for corpora because they contain highly specialized language, often not the common natural language. This is why the parallel corpora in Sketch Engine reflect what is available rather than what Sketch Engine would ideally like to have.
Texts produced by the EU
The European Union produces an enormous amount of text and most of it must be translated into all the official languages of the EU. The texts are often made publicly available, although not in the format of a parallel concordance. Sketch Engine collected these texts, aligned them and produced these parallel corpora:
This is an enormous corpus of various documents. The documents cover various topics. Although it is formal language on the legal side, it covers vocabulary from cars to shrimps and from carrots to pneumatic hammers. It is therefore a good starting point for multilingual reference.
This is a more specialized corpus containing judgements of the Court of Justice of the European Union. https://www.sketchengine.eu/eurlex-judgments-corpus/
This is a corpus of spoken languages used in the European Parliament. It contains even informal language from when the MPs get a bit too excited.
The European Commission’s DGT (Directorate-General for Translation) made its multilingual Translation Memory available for download and Sketch Engine processed it into a parallel corpus.
The situation with languages which are not official languages of the European Union is very complicated. Although governments of most countries and regions do a translation of documents into other languages, these documents are generally not available.
The OPUS2 corpus is currently the only source for non-EU languages and their combinations with many other languages. Refer to the project website for details. The current OPUS corpus in Sketch Engine dates back to 2013 but an up-to-date corpus with a lot more data and covering many more languages will be released in the next couple of months (written in February 2021).