LANGTenTen: Corpus of the LANG Web

The LANG Web Corpus (LANGTenTen) is a LANG corpus made up of texts collected from the Internet and processed by our unique filtering and evaluation technology to only include linguistically valuable content. The corpus belongs to the TenTen corpus family that is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 40 languages.

For detailed information about TenTen corpora, see  Common TenTen corpora attributes.

Part-of-speech tagset

The LANG corpus was tagged by TreeTagger using Penn Treebank tagset with Sketch Engine modifications.

Tools to work with the LANG corpus

A complete set of tools is available to work with this LANG corpus to generate:

  • word sketch – LANG collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of LANG nouns, verbs, adjectives, etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use

LANG corpus in detail

The chart shows the distribution of the parts of speech in the LANG Web corpus 2013.

Basic information

Frequency*
Tokens 22 729
Words 19 686
Sentences 1 120
Web pages 37

* the figures above are rounded to million
Distribution of top-level domains

Overview of LANG TenTen corpora

These web corpora were crawled and processed repeatedly during the last ten years:

  • LANG Web corpus 2015 (enTenTen15) – 15 billion words (advanced genre classification and sophisticated spam removal), the corpus has not published yet.
  • LANG Web corpus 2013 (enTenTen13) – 19 billion words
  • LANG Web corpus 2012 (enTenTen12) – 11 billion words
  • LANG Web corpus 2008 (enTenTen08) – 2.7 billion words

English Web 2015 (enTenTen15)

  • initial size 28 billion words

v2 (spring 2017)

  • 15 billion words
  • genre classification
  • depth analysis of spam and its removal including too short documents

English Web 2013 (enTenTen13)

  • 19 billion words

English Web 2012 (enTenTen12)

version 1 (14 June 2012)

  • sample of corpus – 3.7 billion words
  • crawled by SpiderLing in May 2012
  • encoded in UTF-8

version 2 (2012)

  • full corpus – 11 billion words

English Web 2008 (enTenTen08)

version 1 (15 November 2010)

  • initial version – 3.3 billion tokens
  • crawled by Heritrix in 2008
  • encoded in Latin1

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the LANG corpus

Sketch Engine offers a range of tools to work with this LANG corpus from the web.

or

Other text corpora

Sketch Engine offers 500+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.