New interface
Until bilingual term extraction becomes part of Sketch Engine, it is available instead via a dedicated term extraction interface OneClick Terms.
Bilingual term extraction in detail
Bilingual term extraction is an extension of term extraction. It is available through a special dedicated interface.
Data requirements
Parallel texts aligned on paragraph or sentence level are needed. Upload your translation memory TMX file and Sketch Engine will process it automatically and convert it into aligned corpora. Aligned texts in more than two languages can be uploaded.
Extracting terminology step by step
- Click Start here and select the Two Languages option
- Upload a file with parallel data, supported formats: TMX, XLIFF 2.0+, XLF 2.0+, XLS* and XLSX*
- Set up specific parameters via More settings if needed
- Click Process Data & Extract Bilingual Terminology >
- Check automatically detected source language and target language(s)
- Click Extract Bilingual Terminology
*The first row of the spreadsheet must contain the English names of the languages. The other rows should contain the aligned segments (e.g. sentences, paragraphs), side by side. Each column should only contain data for one language.
The bilingual terms can be saved as TBX by clicking the Download TBX button.
Example
See below an example of bilingual terms extracted from the European Central Bank corpus.
Notes on sorting
From experience with large data (74-million-token DGT), sorting candidates by co-occurrence frequency yield better results. The sorting can be changed in the interface by clicking the column headers. The results are a good starting point when preparing a translation termbase from scratch.
Another example is a small corpus of English-French with UNICEF-related texts. Here the extracted terms are sorted by logDice (a co-occurrence statistics) works better than in the previous example.
References
Bilingual Terminology Extraction in Sketch Engine. Vít Baisa, Barbora Ulipová, and Michal Cukr. In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, the Czech Republic, December 2015, pp. 61–67.