kaWaC: Georgian Corpus from the Web
The Georgian Web Corpus (kaWaC) is a Georgian corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
This Georgian corpus was created during Lexicom 2013 and contains more than 50 million words. Texts were cleaned and deduplicated but not part-of-speech tagged yet.