Chinese Gigaword corpus search

Chinese Gigaword: Corpus of the Mainland and Traditional Chinese

The Chinese Gigaword Corpus is a Chinese corpus made up of Chinese journalism. The corpus contains data from archives of News Agencies and was prepared by Linguistic Data Consortium (LDC) with source data covering the period 1990–2002. Chinese Gigaword comprises almost 600 million words belong to two separate corpora:

Chinese GigaWord 2 Corpus: Mainland, simplified characters

source data is journalism from the Xinhua News Agency, Beijing from 1991 and 2002
size more than 200 million words

Chinese GigaWord 2 Corpus: Taiwan, traditional characters

source data is journalism from the Central News Agency, Taiwan from 1990 and 2002
size more than 380 million words

More information can be found at https://catalog.ldc.upenn.edu/LDC2003T09

Part-of-speech tagset

The Chinese Gigaword corpus has POS tagging with the following Chinese part-of-speech tagset.

Tools to work with the Chinese Gigaword corpus

A complete set of Sketch Engine tools is available to work with these Chinese corpora of Mainland and Traditional Chinese to generate:

word sketch – Chinese collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Chinese nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word units
text type analysis – statistics of metadata in the corpus

Bibliography

Citation

Graff, David, and Ke Chen. Chinese Gigaword LDC2003T09. Web Download. Philadelphia: Linguistic Data Consortium, 2003.

Bibliographical references about the corpus

Hong, J. F., & Huang, C. R. (2006, November). Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research. In PACLIC.

Ma, W. Y., & Huang, C. R. (2006, May). Uniform and effective tagging of a heterogeneous giga-word corpus. In 5th International Conference on Language Resources and Evaluation (LREC2006) (pp. 24-28).

Chinese word sketches

Kilgarriff, A., Huang, C. R., Rychlý, P., Smith, S., & Tugwell, D. (2005). Chinese word sketches.

Search the Chinese Gigaword corpus

Sketch Engine offers a range of tools to work with the Mainland and Traditional Chinese corpora.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide