MalaysianWaC: Malaysian corpus from the web
The Malaysian web corpus (MalaysianWaC) is a Malaysian corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). The Malaysian language is a dialect of the Malay language used in Malaysia.
Data was crawled by the Heritrix web spider in 2010. This Malaysian corpus is comprised of 230 million words.
The Malaysian corpus is PoS tagged using the Apertium tool with the following Malaysian tagset.
Tools to work with the Malaysian corpus
A complete set of Sketch Engine tools is available to work with this Malay corpus (more specifically Malaysian) to generate:
- word sketch – Malaysian collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word units
- word lists – lists of Malaysian nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
version 1 (21st April 2017)
- created word sketches
- added attribute “sera”
initial version (5th April 2017)
- size 17 million words
Corpus factory method
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.
Search the Malaysian corpus
Sketch Engine offers a range of tools to work with the Malaysian corpus from the web.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.