Have Sketch Engine create your own subject-specific corpus

Did you not find the right corpora for you? Do you deal with subject-specific language? The automatic corpus building tool in Sketch Engine will find relevant texts on the web for you, download them and process them into a corpus.

Learn to build corpora in Sketch Engine with this 5-minute video lesson.

Can other users access my data?

Sketch Engine is not a public cloud. Texts you upload will be stored in your personal space in your account. Other users cannot access your texts. You can, however, choose to grant access to other users.

How to use the corpus building

Log in and open the corpus selector at the top and click CREATE CORPUS

Give your corpus a name, choose the language and, optionally, provide some description and click NEXT

Click Find texts on the web. You can also add your own data to the corpus or only make corpus from your own data by clicking I have my own texts.

Select how texts from the web should be found:

  • web search – type keywords and phrases that describe the topic
  • URLs – provide a list of web pages to download
  • Website – provide a website address to obtain up to 2000 text documents from  the website

To use the web search option, type word and phrases and hit ENTER after each one and click GO. Sketch Engine will interact with Bing to find relevant web pages and download them. Click Next when the download finishes.

More texts can be added at this point or any time later. Click COMPILE to process the data into a corpus.

Your corpus is ready to use now.

Click CORPUS DASHBOARD to start working with the corpus. EXTRACT KEYWORDS & TEMS will reveal words which are typical for your corpus to check the topic coverage corresponds to what you expected. CORPUS DETAILS AND STATISTICS gives word counts and other statistics about your corpus.

Corpora from files, URLs or translation memory

You can also create corpora from other sources:

  • files and documents which can be uploaded to Sketch Engine
  • from a user-defined list of web pages
  • from the translation memory of your CAT tool

To learn more about user corpora, please refer to the User manual.