How to select a corpus

corpus has to be selected before you can start using any of the Sketch Engine features. Here are a few tips for new users.

Option 1

  • click Select corpus in the left menu
  • type the language and select it – we will pick the best corpus for you
    OR
    type the corpus name and select it

The corpus dashboard will open giving access to the tools and features.

Option 2

Use the corpus selector at the top:

  • type a few letters from the corpus name or the language
  • select the corpus

The corpus dashboard will open giving access to the tools and features.

Which corpus to choose?

Featured corpora

Featured corpora are a good start for monolingual corpora. These were pre-selected based on the size, quality and the availability of the maximum number of features.

No featured corpus?

If there is no featured corpus in your language, switch to All and use the search. Type a language or a corpus name.

These corpora are excellent general purpose corpora. The main advantage is their large size, typically several billion words.

TenTen is a new generation of Web corpora. They are created by crawling the web in a sophisticated way. The downloaded texts undergo a complex process before they are included in the corpus. The downloaded texts are cleaned from non-text, e.g. navigation menus, legal text or small print, and duplicate text is removed. Downloaded texts are also evaluated and texts which are too short or contain too much content unsuitable for the use in a corpus are removed.  TenTen stands for 1010 (10 billion) words. TenTen corpora in detail»

The main advantage of these corpora is timestamps, the information about texts and their time of publication. This fact enables you to carry out diachronic analysis; finding trending words, neologisms or archaisms. Moreover, the size of the corpora (from hundreds of millions up to billions of words) guarantees also coverage of less frequent words and expressions.

Timestamped corpora are created by crawling news articles from the web across the world. These news articles are detected by a system developed at Jozef Stefan Institute in Slovenia. Currently, the Timestamped English corpus with more than 28 billion words is the biggest corpus in Sketch Engine. Timestamped corpora in detail»

The size of corpus

Sketch Engine provides you hundreds of corpora in various sizes from tiny (less than million words) to really huge (10+ billion words). Generally, exploring languages requires large corpora in order to reduce unwanted bias. See the comparison of the well-known British National Corpus (BNC) with other English corpora in Sketch Engine.

Parallel corpora

Most parallel corpora in Sketch are multilingual corpora, i.e. consist of the same text in many languages. Separately they can be used as monolingual corpora too.

Selecting a parallel corpus

You cannot select a parallel corpus as such, what you need to do is:

  • select the first language
  • go to a feature (e.g. concordance search, or bi-lingual word sketch)
  • when setting the criteria, you will select the second language
    (in the case of the concordance search , you can even select more than one language)

OPUS is a collection of translated texts from the web and it covers a wide selection of subjects and topics and is available in the largest number of languages. This should be your first choice for parallel corpora. more on OPUS»

The corpus is created from the proceedings of the European Parliament and is available in 21 Eruopean langauges. The nature of the corpus makes it a great resource for topics discussed in the European Parliament and for general formal language. Searching language from topic areas which are rare in the European Parliament may not produce good results. more on EUROPARL»

A corpus created from translated documents of the European Union available in the 24 official EU languages. Recommended for general formal language and subject areas covered in EU documents. Since EU documentation relates to many areas, it is suitable for general use too.  more on EUR-Lex»

Display corpus information

After selecting a corpus, click the (i) info button next to the corpus name at the top centre of the screen.