hebWaC – Hebrew corpus from the web

hebWaC: Hebrew web corpus

The Hebrew web corpus (hebWaC) is a Hebrew corpus made up of texts collected from the Internet. This Hebrew corpus is a domain-independent web corpus consists of newspapers pages, blog posts, commercial websites, etc. A final size of the corpus is 47 million words.

The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

Part-of-speech tagset

The hebWaC corpus was tagged and uses the following Hebrew POS tagset summary.

Tools to work with the Hebrew corpus

A complete set of tools is available for working with this Hebrew corpus and generating:

word sketch – Hebrew collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Hebrew nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency lists of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Changelog

2017

computed word sketches

July 2013

initial version without word sketches

Bibliography

Corpus factory method

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.

Search the Hebrew corpus

Sketch Engine offers a range of tools to work with this Hebrew corpus from the web.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

hebWaC: Hebrew web corpus

Part-of-speech tagset

Tools to work with the Hebrew corpus

2017

July 2013

Corpus factory method

Search the Hebrew corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine