TenTen Corpus Family
The TenTen Corpus Family (TenTen corpora) is a family of text corpora created from the Web. All TenTen corpora are prepared according to the same criteria and can be regarded as comparable corpora. The corpora are built using technology specialized in collecting only linguistically valuable web content.
The name TenTen refers to the target corpus size 10+ billion words per language. These TenTen corpora are currently available in 40+ languages, such as English, Spanish, Japanese, Chinese, Greek, Estonian, Arabic, Russian, etc.
TenTen corpora available in Sketch Engine
A total list of TenTen corpora which can be found in Sketch Engine.
Search the TenTen corpora
Sketch Engine offers a range of tools to work with the TenTen corpora.
|arTenTen (Arabic web corpus)||beTenTen (Belarusian web corpus)||bgTenTen (Bulgarian web corpus)|
|caTenTen (Catalan web corpus)||cebTenTen (Cebuano web corpus)||csTenTen (Czech web corpus)|
|daTenTen (Danish web corpus)||deTenTen (German web corpus)||elTenTen (Greek web corpus)|
|enTenTen (English web corpus)||esTenTen (Spanish web corpus with European/American Spanish subcorpora)||etTenTen (Estonian web corpus)|
|fiTenTen (Finnish web corpus)||frTenTen (French web corpus)||heTenTen (Hebrew web corpus)|
|hiTenTen (Hindi web corpus)||huTenTen (Hungarian web corpus)||itTenTen (Italian web corpus)|
|jaTenTen (Japanese web corpus)||koTenTen (Khmer web corpus)||koTenTen (Korean web corpus)|
|loTenTen (Lao & Isan web corpus)||ltTenTen (Lithuanian web corpus)||lvTenTen (Latvian web corpus)|
|miTenTen (Māori web corpus)||nlTenTen (Dutch web corpus)||noTenTen (Norwegian web corpus)|
|plTenTen (Polish web corpus)||ptTenTen (Portuguese web corpus)||roTenTen (Romanian web corpus)|
|ruTenTen (Russian web corpus)||skTenTen (Slovak web corpus)||slTenTen (Slovenian web corpus)|
|svTenTen (Swedish web corpus)||teTenTen (Telugu Web Corpus)||thTenTen (Thai Web Corpus)|
|tlTenTen (Tagalog Web corpus)||trTenTen (Turkish web corpus)||ukTenTen (Ukrainian web corpus)|
|urTenTen (Urdu web corpus)||zhTenTen (Chinese Simplified characters web corpus)|
Description of preparing TenTen corpora
- Corpora are crawled from the Internet with the Spiderling tool, a web spider designed for linguistic purposes.
- The web download is followed by text cleaning when texts are processed by jusText, a heuristic based boilerplate removal tool removing irrelevant (non-text or poor text) content such as navigation links, advertisements, headers, footers, etc.
- The next step is a tokenization process.
- Afterwards, onion performs deduplication on paragraph level.
- Finally, corpus texts are lemmatized and part-of-speech tagged for language for which there are tagger and lemmatizer tools are available.
Detailed information about the mentioned tools can be read on the corpus.tools website and the building of TenTen corpora TenTen building is described in the bibliography (below).
A list of corpus metadata (structural attributes in corpus linguistics) shared by all TenTen corpora.
- Top-level domain – e.g. “com”
- website – e.g. “wikipedia.org”
- Web domain – e.g. “en.wikipedia.org”
- URL – e.g. “https://en.wikipedia.org/wiki/Wikipedia” (URL of the source document)
- wordcount – e.g., “152” (the exact number of words in the document)
- length – e.g., “0–1k” (length of the document in thousands of words)
- heading – number “1” means headline texts, “0” other texts
Attributes specific to particular corpora can be found on the corpus information page.
Tools to work with TenTen Corpora
A complete set of Sketch Engine tools is available to work with TenTen billion-word corpora to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Tools for building new TenTen corpora have constantly developed. More information about these tools is available at http://corpus.tools/
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.