hebWaC: Hebrew web corpus
The Hebrew web corpus (hebWaC) is a Hebrew corpus made up of texts collected from the Internet. This Hebrew corpus is a domain-independent web corpus consists of newspapers pages, blog posts, commercial websites, etc. A final size of the corpus is 47 million words.
The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
The heWaC corpus was tagged and uses the following Hebrew POS tagset summary.