KannadaWaC: Corpus of the Kannada Web
The Kannada Web corpus (KannadaWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider and the WebBootCat tool in 2012 with the final size 11 million words.
For POS tagging of corpus texts, there was used tagger version 2 developed by Siva Reddy and Serge Sharoff. See the POS tagset legend.