TenTen Corpus Family
The TenTen Corpus Family (TenTen corpora) is a family of text corpora created from the Web. All TenTen corpora are prepared according to the same criteria and can be regarded as comparable corpora. The corpora are built using technology specialized in collecting only linguistically valuable web content.
The name TenTen refers to the target corpus size 10+ billion words per language. These TenTen corpora are currently available in 40+ languages, such as English, Spanish, Japanese, Chinese, Greek, Estonian, Arabic, Russian, etc.