Vietnamese corpus (viWaC)

viWaC: Vietnamese corpus from the web

The Vietnamese web corpus (viWaC) is a Vietnamese corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

The corpus consists of 100 million words which were lemmatized and part-of-speech tagged.

Part-of-speech tagset

See the Vietnamese part-of-speech tagset describing POS tags used in this Vietnamese corpus.

Tools to work with the Vietnamese corpus

A complete set of Sketch Engine tools is available to work with this Vietnamese web corpus to generate:

word sketch – Vietnamese collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Vietnamese nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Changelog

version 2 (2012)

created word sketches

version 1 (2010)

initial version

Bibliography

BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawledcorpora. Language resources and evaluation, 2009, 43.3: 209-226.

Corpus factory method

Adam Kilgarriff, Siva Reddy, Jan Pomikálek, and Avinesh PVS. A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.

Vietnamese word sketches

KILGARRIFF, Adam; LE-HONG, Phuong. Vietnamese Word Sketches. In: Proceedings of the First International Workshop on Vietnamese Language and Speech Processing. p. 1-4.

Search the Vietnamese corpus

Sketch Engine offers a range of tools to work with this Vietnamese corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

viWaC: Vietnamese corpus from the web

Part-of-speech tagset

Tools to work with the Vietnamese corpus

Corpus factory method

Vietnamese word sketches

Search the Vietnamese corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine