MalaysianWaC: Malaysian corpus from the web
The Malaysian web corpus (MalaysianWaC) is a Malaysian corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). The Malaysian language is a dialect of the Malay language used in Malaysia.
Data was crawled by the Heritrix web spider in 2010. This Malaysian corpus is comprised of 230 million words.
The Malaysian corpus is PoS tagged using the Apertium tool with the following Malaysian tagset.