TenTen Corpus Family

The TenTen Corpus Family (TenTen corpora) is a collection of text corpora created from the Web. TenTen corpora are prepared according to the same criteria that may guarantee quality result corpus texts and also an option to compare them with each other.

The corpus belongs to the TenTen corpus family which is a set of the same processed web corpora with the target size 10+ billion words. Sketch Engine currently provides access to Tenten corpora in more than 30 languages.

TenTen corpora available in Sketch Engine

A list of TenTen corpora currently comprises text corpora of 30+ languages over the last ten years.

Search the TenTen corpora

Sketch Engine offers a range of tools to work with the TenTen corpora.

or

arTenTen (Arabic web corpus)beTenTen (Belarusian web corpus)bgTenTen (Bulgarian web corpus)
caTenTen (Catalan web corpus)czTenTen (Czech web corpus)daTenTen (Danish web corpus)
deTenTen (German web corpus)elTenTen (Greek web corpus)enTenTen (English web corpus)
esTenTen (Spanish web corpus)esAmTenTen (American Spanish web corpus)etTenTen (Estonian web corpus)
fiTenTen (Finnish web corpus) frTenTen (French web corpus) heTenTen (Hebrew web corpus)
hiTenTen (Hindi web corpus) huTenTen (Hungarian web corpus)itTenTen (Italian web corpus)
jpTenTen (Japanese web corpus) koTenTen (Korean web corpus)ltTenTen (Lithuanian web corpus)
lvTenTen (Latvian web corpus)nlTenTen (Dutch web corpus)noTenTen (Norwegian web corpus)
plTenTen (Polish web corpus) ptTenTen (Portuguese web corpus) roTenTen (Romanian web corpus)
ruTenTen (Russian web corpus) skTenTen (Slovak web corpus) slTenTen (Slovenian web corpus)
svTenTen (Swedish web corpus) trTenTen (Turkish web corpus) uaTenTen (Ukrainian web corpus)
zhTenTen (Chinese Simplified characters web corpus)

Description of preparing TenTen corpora

  1. Corpora are crawled from the Internet with the Spiderling tool, a web spider designed for linguistic purposes.
  2. The web download is followed by text cleaning when texts are processed by jusText, a heuristic based boilerplate removal tool removing irrelevant (non-text or poor text) content such as navigation links, advertisements, headers, footers, etc.
  3. The next step is a tokenization process.
  4. Afterwards, onion performs deduplication on paragraph level.
  5. Finally, corpus texts are lemmatized and part-of-speech tagged for language for which there are tagger and lemmatizer tools are available.

Detailed information about the mentioned tools can be read on the corpus.tools website and the building of TenTen corpora TenTen building is described in the bibliography (below).

Corpus metadata

A list of corpus metadata (structural attributes in corpus linguistics) shared by all TenTen corpora.

Document structures

  • 1st level domain – e.g. “com”
  • 2nd level domain – e.g. “wikipedia.org”
  • Web domain – e.g. “en.wikipedia.org”
  • url – e.g. “https://en.wikipedia.org/wiki/Wikipedia” (URL of the source document)
  • wordcount – e.g. “152” (exact number of words in the document)
  • length – e.g. “0-1k”  (length of the document in thousands of words)

Paragraph structure

  • heading – number “1” means headline texts, “0” other texts

Attributes specific to particular corpora can be found on the corpus information page.

Tools to work with TenTen Corpora

A complete set of Sketch Engine tools is available to work with TenTen billion-word corpora to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

Tools for building new TenTen corpora have constantly developed. More information about these tools is available at http://corpus.tools/

Bibliography

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Other text corpora

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.