The Web as Corpus (WaC) corpora were prepared by the Corpus factory method. Full details are described in the paper below. List of corpora (in order by language):
Korean (koreanWaC) Kannada (Kannada WaC)
Nepali (neWaC – Nepali web corpus), Norwegian (noWaC)
Oromo (orWaC – Oromo web corpus)
Polish (plWaC – Polish Web corpus)
Samoan (smWaC – Samoan web corpus) Serbian (srWaC – Serbian Web corpus) Setswana (tnWaC – Setswana web corpus), Somali (soWaC – Somali web corpus), Spanish (Spanish web corpus), Swahili (swahiliWaC), Swedish (swedishWaC)
Urdu (urWaC web corpus)
Yoruba (yoWaC web corpus)
Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.
Search WaC corpora in Sketch Engine
Sketch Engine offers a range of tools to work with web corpora.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.