itTenTen: Corpus of the Italian Web

The Italian Web corpus (itTenTen) is an Italian corpus made up of texts collected from the Internet. The corpus is a part of the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Part-of-speech tagset

The corpus texts are cleaned, deduplicated and subsequently part-of-speech tagged, lemmatized with the TreeTagger tool using Marco Baroni’s parameter file. The POS tagset description is available here.

Overview of Italian TenTen corpora

  • Italian Web 2016 (itTenTen16) – 4.9 billion words (end of May – mid-August)
  • Italian Web 2010 (itTenTen10) – 2.5 billion words

ittenten corpus in detail

The chart shows the distribution of the parts of speech in the Italian Web corpus 2016.

Basic information

Tokens 5 864
Words 4 990
Sentences 228
Web pages 12

* the figures above are rounded to million

Distribution of top-level domains

Tools to work with the Italian web corpora

A complete set of Sketch Engine tools is available to work with these Italian corpora to generate:

  • word sketch – Italian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of Italian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

itTenTen16 v. 1.1 (July 2017)

  • part-of-speech tagging
  • lemposes

itTenTen16 v. 1.0 (October 2016)

  • initial version – 4.9 billion words

itTenTen10 v. 1.0 (9 September 2010)

  • initial version – 2.6 billion words

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Italian corpus

Sketch Engine offers a range of tools to work with this Italian corpus.

Other Italian corpora

Sketch Engine provides access to 400+ language corpora.

Learn to use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.