Build a corpus from the web

Sketch Engine also serves as corpus building software. It has a unique corpus-building tool, which uses the WebBootCaT technology, to automatically create a text corpus from relevant web pages. Data downloaded from the internet are cleaned, optionally deduplicated and non-text is eliminated to obtain linguistically valuable text material. The user can specify which content should be downloaded via one of these options:

  • by providing some typical words defining the topic (seed words)
    (relevant Wikipedia article(s) can be used for seed word suggestions)
  • by providing a list of URLs which should be downloaded
  • by downloading a complete website

The user can also upload files to build a corpus from.

Who can access my data?

Sketch Engine is not a public cloud. Texts you upload and corpora you create will be stored in your personal space in your account. Other users cannot access your texts. You can, however, choose to grant access to other users. An explicit action has to be taken by the user for this to happen.

How to create a corpus from the web

There are 3 ways to reach the corpus building tool:

  • on the corpus dashboard dashboard click NEW CORPUS
  • on the select corpus advanced screen storage click NEW CORPUS
  • open the corpus selector at the top of each screen and click CREATE CORPUS

In the corpus building interface

  • type a name for your new corpus, select the language, optionally provide a description and click NEXT
  • select the Find texts on the web option
  • click on the help icons help_outline to learn about the  options and settings

This process can be repeated to make the corpus larger. Building from the web can be combined with uploading files to the corpus.

FAQs

As a rule of thumb, do not worry about the advanced settings and use the default settings. Only if the results do not produce the desired results, start looking into the advanced settings.

You can repeat the same procedure several times to enlarge the corpus. Sketch Engine will make sure no page, is included twice.

The allowlist keywords can be useful to avoid ambiguity of the seed words, i.e. you can make some of the unambiguous seed words compulsory to make sure the document matches the topic.

Denylist keywords can also be used to reduce ambiguity (e.g. you might use “politics” when collecting a corpus on the environment using “party”). It is only necessary to use the denylist and allowlist if you irrelevant documents are found, otherwise it is not necessary.

You can run WebBootCaT many times to build a bigger corpus. You should aim for 20-60 seeds if that is possible in your domain. You can repeat the process with the same seeds multiple times (there is only a very small probability the same seed tuples will be chosen). You can also split your seeds to sets of 10 seeds and run the tool with each seed set. Please note that you can use multiwords such as “kick the bucket” using the quotes, and also proper names of different kinds.

Limit the search to only UK domains or the domains of Portugal. Type .uk (.pt) into the site list in the advanced options.

To repeat the process with new seed words, use the keyword extraction from the current corpus.

  • click Home
  • locate your corpus and click the wrench button (manage your corpus)
  • in the left menu click on Keywords and terms, the process will start automatically
  • tick the keywords you want to use as new seed words
  • click Use WebBootCaT with selected words
  • you will need to name this part of your corpus and then proceed as normal.

You can repeat the process as much as you like. You can see how much data you have at each stage by checking the corpus page.

The web building tool uses the jusText tool to remove unwanted content such as page navigation, headers, footers, very short paragraphs (=boilerplate) etc. Distinguishing low quality text from good quality text is very difficult to do programmatically. This is why, on very rare occasions, some good content may be removed too by mistake.

A tip for downloading pages with little text on them: Set Min file size and Min cleaned file size to zero in advanced options. The tool is still likely to ignore short isolated paragraphs which can be the case of some discussions.

Corpus can be created from the web even if the language is not supported by Sketch Engine. Select “–other (UTF-8)–” from the language dropdown if your language is not listed.

  • just the universal tokenizer can be applied (or use your own tokenizer prior to uploading data),
  • no automated taggers can be applied (or use your own tagged prior to uploading data),
  • automatic encoding detection might be limited – uploading files in UTF-8 is recommended,
  • search engine setting will not constrain the search to any language when using WebBootCaT.

The Word sketch feature and related functions work depending on user’s definition or you can select the universal generic sketch grammar.

More on unsupported languages»

When creating a corpus or adding new texts to an existing corpus using the built-in web corpus building tool, a simple strategy is applied to avoid duplicated content:

  • Sketch Engine will not download the same url twice into the same corpus
  • if exactly the same content (an exact copy of the same document) is found on a different url, it will not be downloaded again

Optionally, the user can apply a sophisticated deduplication tool during compilation. It can be applied when the corpus is built for the first time or at any time later. It is powered by the onion deduplication tool.

The Internet contains an enormous number of formats, standards, protocols and settings. Although we try hard to accommodate all of them, we cannot guarantee a particular website or page will be downloaded.

Some websites explicitly only allow certain engines (e.g. only Google) to download their content. We must respect those settings not to be on a denylist.

With the default settings, a page will not be downloaded if it does not contain a linguistically valuable body of text, i.e. either very little text or lots of text but divided into many short unrelated sections such as this page.

Sketch Engine cannot download password-protected web pages.

Solving problems

Log file
The exact reason why the page was not included in the corpus can be found in the log file (MANAGE CORPUS – LOGS, the log name contains bootcat_and_compile.log).

Forums, discussions and other text sparse pages
Set Min file size and Min cleaned file size to zero in the advanced options. Very short isolated paragraphs may still be ignored because they might be incorrectly identified as navigation menus or similar linguistically unsuitable content. If necessary, use the Save as option in your browser and upload it to Sketch Engine manually.

Alternative download tools
Tools such as HTTrack, cURL or Wget might be able to download the problematic pages. These tools can also help with password-protected web pages. Bear in mind possible legal implications when using these tools to download internet content.

For more information on WBC, please see WebBootCaT: a web tool for instant corpora (2006).

WebBootCaT: instant domain-specific corpora to support human translators

  • Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006)
  • In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252