Internet-ZD: Corpus of the Chinese Web 2005

The Chinese Web 2005 (Internet-ZH) is a Chinese corpus made up of texts collected from the Internet in 2005. The corpus was created by Dr. Serge Sharoff. University of Leeds, UK.

Part-of-speech tagset and lemmatization

The Chinese Web 2005 is tokenized and part-of-speech tagged using NEUCSP tools from North Eastern University, China. This tool indicates the part of speech and grammatical category with the following part-of-speech tagset summary.

Tools to work with the Chinese corpus from the web

A complete set of Sketch Engine tools is available to work with this Chinese Web 2005 Internet-ZH corpus to generate:

  • keywords – terminology extraction of one-word units
  • word lists – lists of Chinese nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Search the English corpus enTenTen

Sketch Engine offers a range of tools to work with this English corpus.

or

English Web 2020 (enTenTen20)

version ententen20_tt31_1 (April 2022)

  • 36.5 billion words
  • TreeTagger pipeline version 3.1
  • further cleaning and spam removing
  • genre annotation and topic classification

version ententen20_tt31 (April 2021)

  • 38 billion words (downloaded by SpiderLing in Nov & Dec 2019, Nov & Dec 2020 and Jan 2021)
  • TreeTagger pipeline version 3.1
  • samples from the biggest web domains were manually checked and content with poor linguistic quality was removed.

English Web 2018 (enTenTen18)

version enTenTen18_tt31 (February 2021)

  • 21.9 billion words (Oct & Nov 2018; Jan, Nov & Dec 2017; Nov & Dec 2016; mainly from 2018)
  • TreeTagger pipeline version 3.1
  • manually checking of biggest web domains (account for 70% of all texts) and content with poor linguistic quality was removed.

English Web 2015 (enTenTen15)

  • initial size 28 billion words

version 2 (spring 2017)

  • 15 billion words
  • TreeTagger pipeline version 2

version enTenTen15_tt31 (March 2020)

  • 13 billion words
  • TreeTagger pipeline version 3.1
  • topic classification (according to dmoz.org)
  • depth analysis of spam and its removal including too short documents

English Web 2013 (enTenTen13)

version ententen13_tt2 (2014)

  • 19 billion words
  • TreeTagger pipeline version 2

version ententen13_tt2_1 (fall 2016)

  • new version of word sketch grammar
  • dynamic attribute doc.website instead of doc.t2ld

English Web 2012 (enTenTen12)

version ententen12_sample40M (14 June 2012)

  • sample of corpus – 3.7 billion words
  • crawled by SpiderLing in May 2012
  • encoded in UTF-8

version ententen12_1 (2012)

  • full corpus – 11 billion words

English Web 2008 (enTenTen08)

version 1 (15 November 2010)

  • initial version – 3.3 billion tokens
  • crawled by Heritrix in 2008
  • encoded in Latin1

TenTen corpora

SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Genre annotation

SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.

Other text corpora

Sketch Engine offers 700+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.