Have Sketch Engine create your own subject-specific corpus

Did you not find the right corpora for you? Do you deal with subject-specific language? The automatic corpus building tool in Sketch Engine will find relevant texts on the web for you, download them and process them into a corpus.

Learn to build a corpus in Sketch Engine with this 5-minute video lesson.

Can other users access my data?

Sketch Engine is not a public cloud. Texts you upload will be stored in your personal space in your account. Other users cannot access your texts. You can, however, choose to grant access to other users.

Can the Sketch Engine team use my data?

We (the Sketch Engine team) do not use or exploit the content of your corpora in any way, not even for improving any statistical natural language processing methods. We do not provide any your corpus data to anybody else.

How to build a corpus in detail

Log in and open the corpus selector at the top and click CREATE CORPUS

build a corpus - create corpus

Give your corpus a name, choose the language and, optionally, provide some description and click NEXT

build a corpus - name and language

Click Find texts on the web. You can also add your own data to the corpus or only make corpus from your own data by clicking I have my own texts.

build a corpus - from the web or your own texts

Select how texts from the web should be found:

  • web search – type keywords and phrases that describe the topic
  • URLs – provide a list of web pages to download
  • Website – provide a website address to obtain up to 10,000 text documents from  the website

build a corpus - input type

To use the web search option, type word and phrases and hit ENTER after each one and click GO. Sketch Engine will interact with Bing to find relevant web pages and download them. Click Next when the download finishes.

More texts can be added at this point or any time later. Click COMPILE to process the data into a corpus.

Your corpus is ready to use now.

Click CORPUS DASHBOARD to start working with the corpus. EXTRACT KEYWORDS & TEMS will reveal words which are typical for your corpus to check the topic coverage corresponds to what you expected. CORPUS DETAILS AND STATISTICS gives word counts and other statistics about your corpus.

Inserting search keywords automatically using term extraction

You can use the built-in keyword and term extraction in Sketch Engine to provide you with candidate search keywords to make the corpus bigger. Once you have your first version of the corpus compiled, you will see the link to suggest keywords automatically based on the current content of the corpus:

Adding search keywords automatically using the term extraction.

Build a corpus from files, URLs or translation memory

You can also create corpora from other sources:

  • files and documents which can be uploaded to Sketch Engine
  • from a user-defined list of web pages
  • from the translation memory of your CAT tool

To learn more about user corpora, please refer to the User manual.