Build a corpus from the web
Sketch Engine also serves as corpus building software. It has a unique corpus-building tool, which uses the WebBootCaT technology, to automatically create a text corpus from relevant web pages. Data downloaded from the internet are cleaned, optionally deduplicated and non-text is eliminated to obtain linguistically valuable text material. The user can specify which content should be downloaded via one of these options:
- by providing some typical words defining the topic (seed words)
(relevant Wikipedia article(s) can be used for seed word suggestions)
- by providing a list of URLs which should be downloaded
- by downloading a complete website
The user can also upload files to build a corpus.
How to create a corpus from the web
There are 3 ways to reach the corpus building tool:
- on the corpus dashboard dashboard click NEW CORPUS
- on the select corpus advanced screen storage click NEW CORPUS
- open the corpus selector at the top of each screen and click CREATE CORPUS
In the corpus building interface
- type a name for your new corpus, select the language, optionally provide a description and click NEXT
- select the Find texts on the web option
- click on the help icons help_outline to learn about the options and settings
This process can be repeated to make the corpus larger. Building from the web can be combined with uploading files to the corpus.
How do I decide on the correct parameters?
As a rule of thumb, do not worry about the advanced settings and use the default settings. Only if the results do not produce the desired results, start looking into the advanced settings.
You can repeat the same procedure several times to enlarge the corpus. Sketch Engine will make sure no page, is included twice.
The allowlist keywords can be useful to avoid ambiguity of the seed words, i.e. you can make some of the unambiguous seed words compulsory to make sure the document matches the topic.
Denylist keywords can also be used to reduce ambiguity (e.g. you might use “politics” when collecting a corpus on the environment using “party”). It is only necessary to use the denylist and allowlist if you irrelevant documents are found, otherwise it is not necessary.
How to create a 10-million corpus?
You can run the corpus building tool many times to build a bigger corpus. You should aim for 20-60 seeds if that is possible in your domain. Furthermore, you can repeat the process with the same seeds multiple times (most likely, different seed groups will be used each time). It is also possible to split your seeds to sets of 10 seeds and run the tool with each seed set. Please note that you can use multiwords such as “kick the bucket” using the quotes, and also proper names of different kinds.
How to limit my corpus to British English or European Portuguese only?
Limit the search to only UK domains or the domains of Portugal. Type .uk (.pt) into the site list in the advanced options. Refer to the corresponding help icon help_outline in the interface.
How do I get new seed words when I want to repeat the process?
To repeat the process with new seed words, use the keyword extraction from the current corpus.
- go to Manage corpus and click Make bigger
- select Find texts on the web
- click SUGGESTIONS
- tick the keywords you want to use as new seed words
The terms previously used as seed words are highlighted.
You can repeat the process as much as you like. You can see how much data you have at each stage by checking the corpus page.
Why are some paragraphs missing?
The web building tool uses the jusText tool to remove unwanted content such as page navigation, headers, footers, very short paragraphs (=boilerplate) etc. Distinguishing low quality text from good quality text is very difficult to do programmatically. This is why, on very rare occasions, some good content may be removed too by mistake.
A tip for downloading pages with little text on them: Set Min file size and Min cleaned file size to zero in advanced options. The tool is still likely to ignore short isolated paragraphs which can be the case of some online forums and discussions.
Refer to the help icon help_outline next to the options in the Expert settings.
Corpus can be created from the web even if the language is not supported by Sketch Engine. Select “–other (UTF-8)–” from the language dropdown if your language is not listed.
- just the universal tokenizer can be applied (or use your own tokenizer prior to uploading data),
- no automated taggers can be applied (or use your own tagged prior to uploading data),
- automatic encoding detection might be limited – uploading files in UTF-8 is recommended,
- search engine setting will not constrain the search to any language when using WebBootCaT.
The Word sketch feature and related functions work depending on user’s definition or you can select the universal generic sketch grammar.
More on unsupported languages»
When creating a corpus or adding new texts to an existing corpus using the built-in web corpus building tool, a simple strategy is applied to avoid duplicated content:
- Sketch Engine will not download the same url twice into the same corpus
- if exactly the same content (an exact copy of the same document) is found on a different url, it will not be downloaded again
Optionally, the user can apply a sophisticated deduplication tool during compilation. It can be applied while the corpus is built for the first time or at any time later. It is powered by the onion deduplication tool. Refer to the help icons help_outline in the Expert options on the compilation screen.
I cannot download a specific website.
The Internet contains an enormous number of formats, standards, protocols and settings. Although we try hard to accommodate all of them, we cannot guarantee a particular website or page will be downloaded.
Some websites explicitly only allow certain engines (e.g. only Google) to download their content. We must respect those settings not to be on a denylist.
With the default settings, a page will not be downloaded if it does not contain a linguistically valuable body of text, i.e. either very little text or lots of text but divided into many short unrelated sections such as this page.
Sketch Engine cannot download password-protected web pages.
The exact reason why the page was not included in the corpus can be found in the log file (MANAGE CORPUS – LOGS, the log name contains bootcat_and_compile.log).
Forums, discussions and other text sparse pages
Set Min file size and Min cleaned file size to zero in the advanced options. Very short isolated paragraphs may still be ignored because they might be incorrectly identified as navigation menus or similar linguistically unsuitable content. If necessary, use the Save as option in your browser and upload it to Sketch Engine manually.
Alternative download tools
Tools such as HTTrack, cURL or Wget might be able to download the problematic pages. These tools can also help with password-protected web pages. Bear in mind possible legal implications when using these tools to download internet content.
For more information on WBC, please see WebBootCaT: a web tool for instant corpora (2006).
- Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006)
- In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252