Have Sketch Engine create your own subject-specific corpora
Did you not find the right corpora for you? Do you deal with subject-specific language? WebBootCat is a simple intuitive tool to create a user corpus by automatically downloading relevant texts from the internet.
After logging in, click WebBootCat.
WebBootCaT – create your own corpus from the web
the following example shows the first option.
(1) Name the corpus
(2) select the language
(3) choose how you want to define the topic of the corpus:
- seed words – type keywords and phrases that describe the topic
- URLs – provide a list of web pages to download
- Website – type a website to obtain up to 2000 text documents within this site
(4) Type the seed words, the list does not have to be exhaustive. You can repeat the procedure with different words to harvest more texts later.
(5) Click Next >.
Sketch Engine will find relevant web pages and will display the list. You can exclude some pages by removing the ticks. Click Next >
WebBootCaT – URL suggestion for corpus from the web
Sketch Engine will start downloading the texts from the web pages and will also process the texts for use in Sketch Engine. With large numbers of pages, the process can take several minutes to complete. The process is over when the progress bar reaches 100%. Texts which are too short or have other issues will be excluded.
Relevant texts are downloaded, tagged and processed for duplicates or text unsuitable for inclusion into the corpus.
Your corpus is ready to use now. Click Home in the main menu, go to My corpora a select the newly created corpus. Use the main menu to generate word sketches, thesaurus and to with the corpus as normal.