By spam we refer to text found on the internet which was not produced by a human or which was only produced once but was automatically replicated on many other places on the web. Spam may include:
- texts whose abundance on the internet is highly disproportionate to their frequency outside the internet (pornographic and adult sites, sites selling products for slimming, muscle gain, hair growth and other health products). These sites often get duplicated automatically on various URLs which increases their presence even further.
- machine-generated text, often not intended to communicate any meaningful information
- machine-translated texts
The presence of spam in our web corpora is partly eliminated by deduplication. In the worst case scenario, a maximum of 1 copy of each page will be present in the corpora. However, the main method of avoiding spam in our corpora is the use of seed URLs.
The process of web crawling is not completely random. Before the crawling starts, a list of respectable, high-quality websites is compiled and the web crawler starts by downloading the content of these seed URLs. They can be media sites, blogs, professional sites and also other sites from which we downloaded good content in the past. If a link leading to another website is found, the web crawler will follow the link but it will only continue doing so up to a previously defined level. Since most of the unwanted web content is in English, the level has to be set low when building an English web corpus. It can be higher for other major languages and it can be even higher for less major languages where the danger of reaching spam is much lower.
Seed URLs cannot be used when building user corpora with the built-in WebBootCaT tool. They are only used in web corpus building carried out by the Sketch Engine team. Users can, however, build a corpus by downloading websites one at a time by using WebBootCaT with the website option.
The process of web crawling to obtain a general language web corpus also include various additional criteria.
A document is only kept in the corpus if the downloaded web page contains enough data after applying the above cleaning tools. If the document is too short, for example, one sentence only, the document will not be included because a lone sentence out of context is rarely linguistically valuable. On the other hand, if the document is too long, for example, many thousands of words, it might be an indicator that the content is not a standard webpage or that the content may not be of linguistic nature at all. Such documents are also not included.
When building user corpora with the built-in WebBootCaT tool, these parameters can be set to different values or even disabled to include absolutely all text in the corpus.
During web crawling, the language of the downloaded text is detected and only texts in the desired language are included. This means that an English corpus can contain pages published on German, Spanish, French, Japanese and other websites as long as they are in English.
Where are the texts from?
Despite the use of seed URLs as the starting points for the web crawling, it is not easy to generalise and give a simple answer to this question. However, each document (a downloaded web page) in the corpus comes with metadata such as the source website as well as the exact URL from which the text was downloaded. The user can generate a list of all the websites or URLs together with the number of documents or tokens downloaded from each source. This can provide some insight into where the data come from.
Similarly, it is possible to display this information for each concordance line or to narrow the search to only certain websites so the user is always in control of where the results come from. Text types or subcorpora are the functionalities designed to achieve this.
How to build your own web corpus
There is little point in building your own general purpose web corpus because there is plenty of them in Sketch Engine already, the largest ones have a size of 40 billion words and the Timestamped corpus in 18 languages is even updated daily.
If you need to build a specialized corpus, use the built-in web corpus building tool with one or more of these options:
- build a corpus from a web search
- build a corpus from web links
- download a website.