Build a corpus from the web

Sketch Engine also serves as corpus building software. It has a unique corpus-building tool, which uses the WebBootCaT technology, to automatically create a text corpus from relevant web pages. Data downloaded from the internet are cleaned, optionally deduplicated and non-text is eliminated to obtain linguistically valuable text material. The user can specify which content should be downloaded via one of these options:

  • by providing some typical words defining the topic (seed words)
    (relevant Wikipedia article(s) can be used for seed word suggestions)
  • by providing a list of URLs which should be downloaded
  • by downloading a complete website

Who can access my data?

Sketch Engine is not a public cloud. Texts you upload and corpora you create will be stored in your personal space in your account. Other users cannot access your texts. You can, however, choose to grant access to other users. An explicit action has to be taken by the user for this to happen.

How to create a corpus from the web

There are 3 ways to reach the corpus building tool:

  • on the corpus dashboard dashboard click NEW CORPUS
  • on the select corpus advanced screen storage click NEW CORPUS
  • open the corpus selector at the top of each screen and click CREATE CORPUS

In the corpus building interface

  • type a name for your new corpus, select the language, optionally provide a description and click NEXT
  • select the Find texts on the web option
  • click on the help icons help_outline to learn about the  options and settings

This process can be repeated to make the corpus larger. Building from the web can be combined with uploading files to the corpus.

FAQs

How do I decide on the correct parameters?

As a rule of thumb, do not worry about the advanced settings and use the default settings. Only if the results do not produce the results you need, start looking into the advanced settings.

You can repeat the same procedure several times to enlarge the corpus. Sketch Engine will make sure no page, text or part of text is included twice (deduplication).

The white list keywords can be useful to avoid ambiguity of the seed words, i.e. you can make some of the unambiguous seed words compulsory to make sure the document matches the topic.

Black list keywords can also be used to reduce ambiguity (e.g. you might use “party” when collecting a corpus on the environment using seeds which include “green”). It is only necessary to use the whitelist and blacklists if you are getting irrelevant documents, otherwise it is not necessary.

How to create a 10-million corpus?

You can run WebBootCaT many times to build a bigger corpus. You should aim for 20-60 seeds if that is possible in your domain. You can repeat the process with the same seeds multiple times (there is only a very small probability the same seed tuples will be chosen). You can also split your seeds to sets of 10 seeds and run the tool with each seed set. Please note that you can use multiwords such as “kick the bucket” using the quotes, and also proper names of different kinds.

How to limit my corpus to British English or European Portuguese only?

Limit the search to only UK domains or the domains of Portugal. Type .uk (.pt) into the site list in the advanced options.

How do I get new seed words when I want to repeat the process?

To repeat the process with new seed words, use the keyword extraction from the current corpus.

  • click Home
  • locate your corpus and click the wrench button (manage your corpus)
  • in the left menu click on Keywords and terms, the process will start automatically
  • tick the keywords you want to use as new seed words
  • click Use WebBootCaT with selected words
  • you will need to name this part of your corpus and then proceed as normal.

You can repeat the process as much as you like. You can see how much data you have at each stage by checking the corpus page.

Why are some paragraphs missing?

The web building tool uses the jusText tool to remove unwanted content such as page navigation, headers, footers, very short paragraphs (=boilerplate) etc. Distinguishing low quality text from good quality text is very difficult to do programmatically. This is why, on very rare occasions, some good content may be removed too by mistake.

A tip for downloading pages with little text on them: Set Min file size and Min cleaned file size to zero in advanced options. The tool is still likely to ignore short isolated paragraphs which can be the case of some discussions.

Unsupported languages

Corpus can be created from the web even if the language is not supported by Sketch Engine. Select “–other (UTF-8)–” from the language dropdown if your language is not listed.

  • just the universal tokenizer can be applied (or use your own tokenizer prior to uploading data),
  • no automated taggers can be applied (or use your own tagged prior to uploading data),
  • automatic encoding detection might be limited – uploading files in UTF-8 is recommended,
  • search engine setting will not constrain the search to any language when using WebBootCaT.

The Word sketch feature and related functions work depending on user’s definition or you can select the universal generic sketch grammar.

More on unsupported languages»

Duplicated content

When creating a corpus or adding new texts to an existing corpus using WebBootCaT, a simple strategy is applied to avoid duplicated content:

  • Sketch Engine will not not download the same url twice into the same corpus
  • if exactly the same content (an exact copy of the same document) is found on a different url, it will not be downloaded again

A sophisticated deduplication option becomes available if the user decides to manually compile a corpus. This deduplication option has to be manually selected. This deduplication uses the onion deduplication tool.

I cannot download a specific website.

The internet is a decentralised and constantly changing place, therefore, we cannot guarantee a particular website is downloaded. You can try to use another tool downloading entire websites, e.g. HTTrack, cURL or Wget tool. Also, these alternative tools can help you with gaining texts from password protected web pages.

A tip for downloading text sparse pages (e.g. internet forum): Set Min file size and Min cleaned file size to zero in advanced options. A tool for boilerplate removal is used to extract text in a web page. The tool is likely to ignore short isolated paragraphs which can be the case of some discussions.

Bibliography

For more information on WBC, please see WebBootCaT: a web tool for instant corpora (2006).

WebBootCaT: instant domain-specific corpora to support human translators

  • Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006)
  • In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252