CantoneseWaC: Cantonese corpus from the Web
The Cantonese Web Corpus (CantoneseWaC) is a Chinese corpus made up of texts collected from the Internet. There were used only Cantonese seed words for crawling texts. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
The CantoneseWaC corpus has not PoS tagged yet.