OpenSubtitles: multilingual corpora in 58 languages
The OpenSubtitles parallel corpora 2018 are a collection of parallel corpora made up of translated movie subtitles at https://www.opensubtitles.org/. The collection consists of 60 corpora in 58 languages. There are two separate corpora of Chinese character standards (Chinese Simplified and Chinese Traditional) as well as two corpora for Portuguese language varieties – European Portuguese and Brazilian Portuguese.
The list of languages in the collection of the OpenSubtitles corpora includes: Afrikaans, Albanian, Arabic, Armenian, Basque, Bengali, Bosnian, Breton, Bulgarian, Catalan, Chinese (simplified characters and traditional characters), Croatian, Czech, Danish, Dutch, English, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Norwegian, Persian (Farsi), Polish, Portuguese (Brazilian and European), Romanian, Russian, Serbian, Sinhalese, Slovak, Slovenian, Spanish, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu and Vietnamese.
The data were gained from the OPUS project that is maintained by Joerg Tiedemann. We process the texts in terms of lemmatization and part-of-speech tagging including word sketches and term grammars.
The OpenSubtitles parallel corpora have the sentence alignment and you can search and analyze monolingually (as a standard single corpus) or multilingually (as parallel corpora).
Tools to work with the OpenSubtitles parallel corpora
A complete set of tools is available to work with the multilingual corpora from OpenSubtitles.org to generate:
- parallel concordance – examples of translations in context
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives, etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
The set of tools may vary depending on the particular language.
OpenSubtitles parallel corpora – statistics
The table below shows the number of sentence pairs aligned in each language pair of OpenSubtitles parallel corpora. For example, in relation to the languages Afrikaans (af) and Arabic (ar), there are ~12,000 sentences aligned in the direction Afrikaans–Arabic (see the number in the 2nd line of the 6th column) and in the opposite direction Arabic–Afrikaans ~12,300 sentences (see the number in the 3rd line of the 5th column).
The 2nd column (files), the 3rd column (tokens), and the 4th column (sentences) show the total number of files, tokens, and sentences respectively of the particular language (the size of the corpus for a single language).
Bibliography & citation
Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. [pdf] In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012).
Pierre Lison and Jörg Tiedemann, 2016 OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. [pdf] In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
Search the OpenSubtitles parallel corpora
Sketch Engine offers a range of tools to work with the OpenSubtitles parallel corpora.
Learn to work with multilingual and parallel corpora in Sketch Engine. Find more in our user guide.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.