The Internet contains an enormous number of formats, standards, protocols and settings. Although we try hard to accommodate all of them, we cannot guarantee a particular website or page will be downloaded.
Some websites explicitly only allow certain engines (e.g. only Google) to download their content. We must respect those settings not to be on a denylist.
With the default settings, a page will not be downloaded if it does not contain a linguistically valuable body of text, i.e. either very little text or lots of text but divided into many short unrelated sections such as this page.
Sketch Engine cannot download password-protected web pages.
The exact reason why the page was not included in the corpus can be found in the log file (MANAGE CORPUS – LOGS, the log name contains bootcat_and_compile.log).
Forums, discussions and other text sparse pages
Set Min file size and Min cleaned file size to zero in the advanced options. Very short isolated paragraphs may still be ignored because they might be incorrectly identified as navigation menus or similar linguistically unsuitable content. If necessary, use the Save as option in your browser and upload it to Sketch Engine manually.
Alternative download tools
Tools such as HTTrack, cURL or Wget might be able to download the problematic pages. These tools can also help with password-protected web pages. Bear in mind possible legal implications when using these tools to download internet content.