TaiwanWaC: Chinese Corpus from the Web
The Taiwan Chinese Web Corpus (TaiwanWaC) is a Chinese corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). The corpus consists of 260 million words in total.
The TaiwanWaC corpus was PoS tagged by Stanford TreeTagger tool using Chinese Penn TreeBank, see the tagset legend.