Have Sketch Engine create your own subject-specific corpus
Did you not find the right corpora for you? Do you deal with subject-specific language? The automatic corpus building tool in Sketch Engine will find relevant texts on the web for you, download them and process them into a corpus.
Learn to build corpora in Sketch Engine with this 5-minute video lesson.
How to use the corpus building
Log in and open the corpus selector at the top and click CREATE CORPUS
Give your corpus a name, choose the language and, optionally, provide some description and click NEXT
Click Find texts on the web. You can also add your own data to the corpus or only make corpus from your own data by clicking I have my own texts.
Select how texts from the web should be found:
- web search – type keywords and phrases that describe the topic
- URLs – provide a list of web pages to download
- Website – provide a website address to obtain up to 2000 text documents from the website
To use the web search option, type word and phrases and hit ENTER after each one and click GO. Sketch Engine will interact with Bing to find relevant web pages and download them. Click Next when the download finishes.
More texts can be added at this point or any time later. Click COMPILE to process the data into a corpus.
Your corpus is ready to use now.
Click CORPUS DASHBOARD to start working with the corpus. EXTRACT KEYWORDS & TEMS will reveal words which are typical for your corpus to check the topic coverage corresponds to what you expected. CORPUS DETAILS AND STATISTICS gives word counts and other statistics about your corpus.
Inserting search keywords automatically using term extraction
You can use the built-in keyword and term extraction in Sketch Engine to provide you with candidate search keywords to make the corpus bigger. Once you have your first version of the corpus compiled, you will see the link to suggest keywords automatically based on the current content of the corpus:
Corpora from files, URLs or translation memory
You can also create corpora from other sources:
- files and documents which can be uploaded to Sketch Engine
- from a user-defined list of web pages
- from the translation memory of your CAT tool
To learn more about user corpora, please refer to the User manual.