If you are a new user, it might not be clear which corpus you should be using. Here are a few tips_
Featured corpora are a good start for monolingual corpora. These have been pre-selected based on the size and the availability of maximum number of features.
No featured corpus?
If there is no featured corpus in your language, switch to All and use the drop-down to select the language and pick the largest corpus.
These corpora are excellent general purpose corpora. The main advantage is their large size, typically several billion words.
TenTen is a new generation of Web corpora. They are created by crawling the web in a sophisticated way. The downloaded texts undergo a complex process before they are included in the corpus. The downloaded texts are cleaned from non-text, e.g. navigation menus, legal text or small print, and duplicate text is removed. Downloaded texts are also evaluated and texts which are too short or contain too much content unsuitable for the use in a corpus are removed. TenTen stands for 1010 (10 billion) words. TenTen corpora in detail»
The main advantage of these corpora is timestamps, the information about texts and their time of publication. This fact enables you to carry out diachronic analysis; finding trending words, neologisms or archaisms. Moreover, the size of the corpora (from hundreds of millions up to billions of words) guarantees also coverage of less frequent words and expressions.
Timestamped corpora are created by crawling news articles from the web across the world. These news articles are detected by a system developed at Jozef Stefan Institute in Slovenia. Currently, the Timestamped English corpus with more than 28 billion words is the biggest corpus in Sketch Engine. Timestamped corpora in detail»
The size of corpus
Sketch Engine contains hundreds of corpora in various sizes, from tiny (less than a million words) to really huge (10+ billion words). As a rule of thumb, a large corpus produces more data and better data than a small one. See the comparison of the well-known British National Corpus (BNC) with other English corpora in Sketch Engine.
Most parallel corpora in Sketch are multilingual corpora, i.e. consist of the same text in many languages. They can be used separately as monolingual corpora too.
Selecting a parallel corpus
You cannot select a parallel corpus as such, what you need to do is:
select the first language
go to a feature (e.g. concordance search, or bilingual word sketch)
when setting the criteria, you will select the second language
(in the case of the concordance search , you can even select more than one language)
OPUS corpora (recommended)
OPUS is a collection of translated texts from the web and it covers a wide selection of subjects and topics and is available in the largest number of languages. This should be your first choice for parallel corpora. more on OPUS»
The corpus is created from the proceedings of the European Parliament and is available in 21 Eruopean langauges. The nature of the corpus makes it a great resource for topics discussed in the European Parliament and for general formal language. Searching language from topic areas which are rare in the European Parliament may not produce good results. more on EUROPARL»
A corpus created from translated documents of the European Union available in the 24 official EU languages. Recommended for general formal language and subject areas covered in EU documents. Since EU documentation relates to many areas, it is suitable for general use too. more on EUR-Lex»