roWaC: Romanian web corpus

The Romanian web corpus (roWaC) is a Romanian corpus made up of texts collected from the Internet. This Romanian corpus was gathered by Monica Macoveiciuc, Alexandru Ioan Cuza University, Iasi from the web using two methods, based on WebBootCat and Heritrix with a total size 44 million words. The text collected through these tools was further processed in order to remove the unwanted content. Word sketches were prepared by Monica Macoveiciuc.

Part-of-speech tagset

The roWaC corpus was lemmatized tagged with TTL (Tokenizing, Tagging and Lemmatizing free running texts), developed by RACAI – Research Institute for Artificial Intelligence, Romanian Academy. See the Romanian PoS tagset summary.

Tools to work with the Romanian corpus

A complete set of Sketch Engine tools is available to work with this Romanian corpus from the web to generate:

  • word sketch – Romanian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Romanian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version with lempos (April 2018)

  • lempos attribute created and used for word sketches

initial version (August 2009)

Bibliography

Amharic web corpus

Rychlý, P., & Suchomel, V. (2016, September). Annotated Amharic Corpora. In International Conference on Text, Speech, and Dialogue (pp. 295-302). Springer International Publishing.

Corpus factory method

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.

Search the Romanian corpus

Sketch Engine offers a range of tools to work with the Romanian corpus from the web.

or

Other text corpora

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.