The Web as Corpus (WaC) corpora were prepared by the Corpus factory method. Full details are described in the paper below. List of corpora (in order by language):

A

Arabic (Arabic web corpus), Amharic (AmWaC web corpus)

B

Basque (basque_WaC) Bengali (bengaliWaC) Bosnian (bosnianWaC14)

C

Cantonese (Cantonese WaC) Chinese (ChineseTaiwanWaC) Croatian (hrWaC)

D

Danish (danishWaC) Dutch (Dutch web corpus)

E

English (pukWaCukWaC – British English corpusukWaCsst)

F

Filipino (filipinoWaC) Finnish (finnishWaC) Frisian (frisianWaC) French (frWaC)

G

Georgian (georgianWaC) German (deWaC, Parsed deWaC (sdeWaC)) Greek (gkWaC) Gujarati (gujarathiWaC)

H

Hausa, (haWaC web corpus), Hebrew (hebWaC) Hindi (hindiWaC)

I

Igbo (igWaC) Indonesian (indonesianWaC) Italian (itWaC)

J

Japanese (jpWaC)

K

Korean (koreanWaC) Kannada (Kannada WaC)

L

Latin (latinWaC), Latvian (lvWaC – Latvian Web corpus), Lithuanian (lithuanianWaC – Lithuanian web corpus)

M

Malaysian (zsmWaC – Malaysian web corpus), Malayalam (mlWaC web corpus), Maltese (mtWaC – Maltese Wac corpus), Maori (miWaC – Maori web corpus), Mongolian (mnWaC – Mongolian web corpus)

N

Nepali (neWaC – Nepali web corpus), Norwegian (noWaC)

O

Oromo (orWaC – Oromo web corpus)

P

Polish (plWaC – Polish Web corpus)

R

Romanian (roWaC), Russian (Russian Web Corpus)

S

Samoan (smWaC – Samoan web corpus) Serbian (srWaC – Serbian Web corpus) Setswana (tnWaC – Setswana web corpus), Somali (soWaC – Somali web corpus), Spanish (Spanish web corpus), Swahili (swahiliWaC), Swedish (swedishWaC)

T

Tamil (tamilWaC) Tatar (Tatar Sample) Telugu (teluguWaC), Thai (thaiWaC), Tigrinya (tiWaC) Turkish (trWaC – Turkish Web Corpus)

U

Urdu (urWaC web corpus)

V

Vietnamese (viWaC)

W

Welsh (welshWaC)

Y

Yoruba (yoWaC web corpus)


Bibliography

Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.

Search WaC corpora in Sketch Engine

Sketch Engine offers a range of tools to work with web corpora.

or

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Other text corpora

Sketch Engine offers 400+ language corpora.