The Web as Corpus (WaC) corpora were prepared by the Corpus factory method. Full details are described in the paper below. List of corpora (in order by language):

A

Arabic (arWaC web corpus), Amharic (AmWaC web corpus)

B

Basque (euWaC), Bengali (bnWaC), Bosnian (bsWaC)

C

Cantonese (yueWaC), Chinese (ChineseTaiwanWaC), Croatian (hrWaC)

D

Danish (dkWaC), Dutch (Dutch web corpus)

E

English (pukWaCukWaC – British English corpusukWaCsst)

F

Filipino (filWaC), Frisian (fyWaC), French (frWaC)

G

Georgian (kaWaC), German (deWaC, Parsed deWaC (sdeWaC)), Greek (gkWaC), Gujarati (guWaC)

H

Hausa (haWaC ), Hebrew (hebWaC), Hindi (hindiWaC)

I

Igbo (igWaC), Indonesian (idWaC), Italian (itWaC)

J

Japanese (jpWaC)

K

Kannada (knWaC)

L

Latvian (lvWaC – Latvian web corpus), Lithuanian (ltWaC – Lithuanian web corpus)

M

Malaysian (zsmWaC – Malaysian web corpus), Malayalam (mlWaC web corpus), Maltese (mtWaC – Maltese Wac corpus), Maori (miWaC – Maori web corpus), Mongolian (mnWaC – Mongolian web corpus)

N

Nepali (neWaC – Nepali web corpus)

O

Oromo (orWaC – Oromo web corpus)

P

Polish (plWaC – Polish Web corpus)

R

Russian (ruWac – Russian Web Corpus)

S

Samoan (smWaC – Samoan web corpus), Serbian (srWaC – Serbian Web corpus), Setswana (tnWaC – Setswana web corpus), Slovenian (slWaC2.1) Somali (soWaC – Somali web corpus), Spanish (esWaC – Spanish web corpus), Swahili (swWaC), Swedish (svWaC)

T

Tamil (tamilWaC), Tatar (Tatar Sample), Telugu (teluguWaC), Thai (thaiWaC), Tigrinya (tiWaC), Turkish (trWaC – Turkish Web Corpus)

U

Urdu (urWaC web corpus)

V

Vietnamese (viWaC)

W

Welsh (welshWaC)

Y

Yoruba (yoWaC web corpus)


Bibliography

Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.

Search WaC corpora in Sketch Engine

Sketch Engine offers a range of tools to work with web corpora.

or

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Other text corpora

Sketch Engine offers 800+ language corpora.