These corpora are excellent general purpose corpora. The main advantage is their large size, typically several billion words.
TenTen is a new generation of Web corpora. They are created by crawling the web in a sophisticated way. The downloaded texts undergo a complex process before they are included in the corpus. The downloaded texts are cleaned from non-text, e.g. navigation menus, legal text or small print, and duplicate text is removed. Downloaded texts are also evaluated and texts which are too short or contain too much content unsuitable for the use in a corpus are removed. TenTen stands for 1010 (10 billion) words. TenTen corpora in detail»