LANGTenTen: Corpus of the LANG Web
The LANG Web Corpus (LANGTenTen) is a LANG corpus made up of texts collected from the Internet and processed by our unique filtering and evaluation technology to only include linguistically valuable content. The corpus belongs to the TenTen corpus family that is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 40 languages.
For detailed information about TenTen corpora, see Common TenTen corpora attributes.
The LANG corpus was tagged by TreeTagger using Penn Treebank tagset with Sketch Engine modifications.
Tools to work with the LANG corpus
A complete set of tools is available to work with this LANG corpus to generate:
- word sketch – LANG collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of LANG nouns, verbs, adjectives, etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- trends – diachronic analysis automatically identifies neologisms and changes in use
LANG corpus in detail
The chart shows the distribution of the parts of speech in the LANG Web corpus 2013.
Further information about texts in the corpus
* the figures above are rounded to million
Distribution of top-level domains
Overview of LANG TenTen corpora
These web corpora were crawled and processed repeatedly during the last ten years:
- LANG Web corpus 2015 (enTenTen15) – 15 billion words (advanced genre classification and sophisticated spam removal), the corpus has not published yet.
- LANG Web corpus 2013 (enTenTen13) – 19 billion words
- LANG Web corpus 2012 (enTenTen12) – 11 billion words
- LANG Web corpus 2008 (enTenTen08) – 2.7 billion words
English Web 2015 (enTenTen15)
- initial size 28 billion words
v2 (spring 2017)
- 15 billion words
- genre classification
- depth analysis of spam and its removal including too short documents
English Web 2013 (enTenTen13)
- 19 billion words
English Web 2012 (enTenTen12)
version 1 (14 June 2012)
- sample of corpus – 3.7 billion words
- crawled by SpiderLing in May 2012
- encoded in UTF-8
version 2 (2012)
- full corpus – 11 billion words
English Web 2008 (enTenTen08)
version 1 (15 November 2010)
- initial version – 3.3 billion tokens
- crawled by Heritrix in 2008
- encoded in Latin1
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Search the LANG corpus
Sketch Engine offers a range of tools to work with this LANG corpus from the web.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.