Chinese Gigaword: Corpus of the Mainland and Traditional Chinese
The Chinese Gigaword Corpus is a Chinese corpus made up of Chinese journalism. The corpus contains data from archives of News Agencies and was prepared by Linguistic Data Consortium (LDC) with source data covering the period 1990–2002. Chinese Gigaword comprises almost 600 million words belong to two separate corpora:
Chinese GigaWord 2 Corpus: Mainland, simplified characters
- source data is journalism from the Xinhua News Agency, Beijing from 1991 and 2002
- size more than 200 million words
Chinese GigaWord 2 Corpus: Taiwan, traditional characters
- source data is journalism from the Central News Agency, Taiwan from 1990 and 2002
- size more than 380 million words
More information can be found at https://catalog.ldc.upenn.edu/LDC2003T09
The Chinese Gigaword corpus has POS tagging with the following Chinese part-of-speech tagset.
Tools to work with the Chinese Gigaword corpus
A complete set of Sketch Engine tools is available to work with these Chinese corpora of Mainland and Traditional Chinese to generate:
- word sketch – Chinese collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Chinese nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word units
- text type analysis – statistics of metadata in the corpus
Graff, David, and Ke Chen. Chinese Gigaword LDC2003T09. Web Download. Philadelphia: Linguistic Data Consortium, 2003.
Bibliographical references about the corpus
Hong, J. F., & Huang, C. R. (2006, November). Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research. In PACLIC.
Ma, W. Y., & Huang, C. R. (2006, May). Uniform and effective tagging of a heterogeneous giga-word corpus. In 5th International Conference on Language Resources and Evaluation (LREC2006) (pp. 24-28).
Chinese word sketches
Kilgarriff, A., Huang, C. R., Rychlý, P., Smith, S., & Tugwell, D. (2005). Chinese word sketches.
Search the Chinese Gigaword corpus
Sketch Engine offers a range of tools to work with the Mainland and Traditional Chinese corpora.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.