itSkELL: Italian corpus for SkELL

The Italian corpus for SkELL (itSkELL) is an Italian corpus made up of texts collected from the Internet. The texts come from the itTenTen corpus 2016 according to the choice of Egon W. Stemle from Eurac Research. The corpus was specially built up in order to provide the best sentence examples.


SkELL is an abbreviation of Sketch Engine for Language Learning. It is a freely available web interface suitable for Italian learning.

Good sentence examples

The corpus consists of only sentences (adjoining sentences does not have to relate to each other) which were sorted according to their text quality. This quality is computed by GDEX system that adds a score to each sentence. The score is mainly based on the sentence length (minimum and maximum length) and a word frequency of particular words which occur in the sentence. The sentences are sorted in the way that the sentences with the highest score are displayed as first results of a concordance.

Part-of-speech tagset

The itSkELL corpus is PoS tagged with the TreeTagger tool using Marco Baroni’s parameter file. The POS tagset description is available here.

Tools to work with the Italian corpus

A complete set of tools is available to work with this Italian corpus to generate:

  • word sketch – Italian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Italian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context


English Web 2015 (enTenTen15)

  • initial size 28 billion words

v2 (spring 2017)

  • 15 billion words
  • genre classification
  • depth analysis of spam and its removal including too short documents

English Web 2013 (enTenTen13)

  • 19 billion words

English Web 2012 (enTenTen12)

version 1 (14 June 2012)

  • sample of corpus – 3.7 billion words
  • crawled by SpiderLing in May 2012
  • encoded in UTF-8

version 2 (2012)

  • full corpus – 11 billion words

English Web 2008 (enTenTen08)

version 1 (15 November 2010)

  • initial version – 3.3 billion tokens
  • crawled by Heritrix in 2008
  • encoded in Latin1


Baisa, V., & Suchomel, V. (2014, December). SkELL – Web Interface for English Language Learning. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (pp. 63-70).

Search the Italian corpus

Sketch Engine offers a range of tools to work with this Italian corpus for the SkELL interface.

Other text corpora

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.