The Web as Corpus (WaC) corpora were prepared by the Corpus factory method. Full details are described in the paper below. List of corpora (in order by language):
A
Arabic (Arabic web corpus), Amharic (AmWaC web corpus)
B
Basque (basque_WaC) Bengali (bengaliWaC) Bosnian (bosnianWaC14)
C
Cantonese (Cantonese WaC) Chinese (ChineseTaiwanWaC) Croatian (hrWaC)
D
Danish (danishWaC) Dutch (Dutch web corpus)
E
English (pukWaC, ukWaC – British English corpus, ukWaCsst)
F
Filipino (filipinoWaC) Finnish (finnishWaC) Frisian (frisianWaC) French (frWaC)
G
Georgian (georgianWaC) German (deWaC, Parsed deWaC (sdeWaC)) Greek (gkWaC) Gujarati (gujarathiWaC)
H
Hausa, (haWaC web corpus), Hebrew (hebWaC) Hindi (hindiWaC)
I
Igbo (igWaC) Indonesian (indonesianWaC) Italian (itWaC)
J
Japanese (jpWaC)
K
Korean (koreanWaC) Kannada (Kannada WaC)
L
Latin (latinWaC), Latvian (lvWaC – Latvian Web corpus), Lithuanian (lithuanianWaC – Lithuanian web corpus)
M
Malaysian (zsmWaC – Malaysian web corpus), Malayalam (mlWaC web corpus), Maltese (mtWaC – Maltese Wac corpus), Maori (miWaC – Maori web corpus), Mongolian (mnWaC – Mongolian web corpus)
N
Nepali (neWaC – Nepali web corpus), Norwegian (noWaC)
O
Oromo (orWaC – Oromo web corpus)
P
Polish (plWaC – Polish Web corpus)
R
Romanian (roWaC), Russian (Russian Web Corpus)
S
Samoan (smWaC – Samoan web corpus) Serbian (srWaC – Serbian Web corpus) Setswana (tnWaC – Setswana web corpus), Somali (soWaC – Somali web corpus), Spanish (Spanish web corpus), Swahili (swahiliWaC), Swedish (swedishWaC)
T
Tamil (tamilWaC) Tatar (Tatar Sample) Telugu (teluguWaC), Thai (thaiWaC), Tigrinya (tiWaC) Turkish (trWaC – Turkish Web Corpus)
U
Urdu (urWaC web corpus)
V
Vietnamese (viWaC)
W
Welsh (welshWaC)
Y
Yoruba (yoWaC web corpus)
Bibliography
Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.
Search WaC corpora in Sketch Engine
Sketch Engine offers a range of tools to work with web corpora.
or
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.