mlWaC: Malayalam corpus from the web
The Malayalam Web Corpus (mlWaC) is a Malayalam corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data were downloaded in fall 2012 with the total size 16 million words. Texts were cleaned and deduplicated.