| Afrikaans |
Afrikaans Wikipedia corpus 2018 (afwiki) |
14,466,792 |
|
|
| Albanian |
OPUS2 Albanian |
46,304,346 |
|
|
| Amharic |
Amharic Web 2013-17 (amWaC17) |
25,975,846 |
|
|
| Arabic |
Arabic Web 2012 (arTenTen12, Stanford tagger) |
7,475,624,779 |
|
|
| Azerbaijani |
Turkic web – Azerbaijani |
94,267,206 |
|
|
| Basque |
Basque Web (BasqueWaC v2) |
99,719,584 |
|
|
| Belarusian |
Belarusian Web 2016 (beTenTen16) |
63,327,264 |
|
|
| Bengali |
Bengali Web (bnWaC) |
11,519,730 |
|
|
| Bosnian |
Bosnian Web (bsWaC 1.2) |
248,478,730 |
|
|
| Bulgarian |
Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) |
705,156,683 |
|
|
| Cantonese |
Cantonese Web (CantoneseWaC) |
30,898,663 |
|
|
| Catalan |
Catalan Web 2014 (caTenTen14 v2) |
182,691,653 |
|
|
| Chinese Simplified |
Chinese Web 2017 (zhTenTen17) Simplified |
13,531,331,169 |
Chinese Simplified Web 2017 sample |
250,361,047 |
| Chinese Traditional |
Chinese Web 2017 (zhTenTen17) Traditional |
2,400,405,372 |
Chinese Traditional Web 2017 (zhTenTen17) sample |
239,882,651 |
| Croatian |
Croatian Web (hrWaC 2.2, RFTagger) |
1,211,328,660 |
Croatian Web (hrWaC 2.2, RFTagger) |
1,211,328,660 |
| Czech |
Czech Web 2017 (csTenTen17) |
10,502,222,474 |
Czech Web 2017 sample |
249,877,322 |
| Danish |
Danish Web 2017 (daTenTen17) |
2,170,690,492 |
Danish Web 2017 sample |
214,447,970 |
| Dutch |
Dutch Web 2014 (nlTenTen14) |
2,253,777,579 |
Dutch Web 2014 (nlTenTen14) |
2,253,777,579 |
| English |
English Web 2015 (enTenTen15) |
13,190,556,334 |
English Web 2015 (enTenTen15) |
13,190,556,334 |
| Estonian |
Estonian National Corpus 2019 (Estonian NC 2019) |
1,500,284,681 |
Estonian National Corpus 2019 (Estonian NC 2019) |
1,500,284,681 |
| Filipino |
Filipino Web (FilipinoWaC) |
26,991,049 |
|
|
| Finnish |
Finnish Web 2014 (fiTenTen14) |
1,404,083,812 |
Finnish Web 2014 (fiTenTen14) |
1,404,083,812 |
| French |
French Web 2017 (frTenTen17) |
5,752,261,039 |
French Web 2017 sample |
404,555,405 |
| Frisian |
Western Frisian Web 2013 (FrisianWaC) |
3,116,119 |
|
|
| Georgian |
Georgian Web 2013 (kaWaC) |
50,713,604 |
|
|
| German |
German Web 2013 (deTenTen13) |
16,526,335,416 |
German Web 2013 sample |
193,838,751 |
| Greek |
Greek Web 2014 (elTenTen14) |
1,671,692,845 |
|
|
| Gujarati |
Gujarati Web (guWaC) |
17,960,095 |
|
|
| Hausa (Boko) |
Hausa Web 2015 (hausaWaC15) |
5,304,300 |
|
|
| Hebrew |
Hebrew Web 2014 (heTenTen14, no POS tagging) |
890,282,843 |
|
|
| Hindi |
Hindi Web 2012 (HindiWaC v. 4) |
107,960,109 |
|
|
| Hungarian |
Hungarian Web 2012 (huTenTen12) |
2,572,620,694 |
|
|
| Icelandic |
Icelandic texts [sample] |
5,436,035 |
|
|
| Igbo |
Igbo Web 2015 (IgboWaC15) |
331,042 |
|
|
| Indonesian |
Indonesian Web (IndonesianWaC) |
90,120,046 |
|
|
| Irish |
New Corpus for Ireland (NCI Irish) |
29,886,201 |
|
|
| Italian |
Italian Web 2016 (itTenTen16) |
4,989,729,171 |
Italian Web 2016 sample |
201,204,942 |
| Japanese |
Japanese Web 2011 sample (jaTenTen11, LUW) |
163,837,671 |
Japanese Web 2011 sample (jaTenTen11, LUW) |
163,837,671 |
| Kannada |
Kannada Web 2012 (knWaC12) |
11,056,526 |
|
|
| Kazakh |
Turkic web – Kazakh |
139,417,763 |
|
|
| Khmer |
Khmer Web 2018 (kmTenTen18) |
16,500,379 |
|
|
| Korean |
Korean Web 2018 (koTenTen18) |
1,668,851,720 |
Korean 2018 term reference corpus (koTenTen18_term_ref) |
83,749,660 |
| Kyrgyz |
Turkic web – Kyrgyz |
19,369,507 |
|
|
| Lao |
Lao Web 2018 (loTenTen18) |
15,862,991 |
|
|
| Latin |
LatinISE historical corpus v2.2 |
11,036,900 |
|
|
| Latvian |
Latvian Web 2014 (lvTenTen14) |
530,367,474 |
|
|
| Lithuanian |
Lithuanian Web 2014 (ltTenTen14) |
778,151,979 |
|
|
| Macedonian |
OPUS2 Macedonian |
40,348,792 |
|
|
| Malay |
Malaysian Web (MalaysianWaC) |
182,578,743 |
|
|
| Malayalam |
Malayalam Web (malayalamWaC) |
15,950,663 |
|
|
| Maltese |
Maltese MLRS Corpus |
110,714,844 |
|
|
| Maori |
Maori Web (MaoriWaC) |
6,952,801 |
Maori Web (MaoriWaC) |
6,952,801 |
| Nepali |
Nepali National Corpus |
13,440,835 |
|
|
| Norwegian (Mixed) |
Norwegian Web 2017 (noTenTen17, Bokmål) |
2,472,483,911 |
Norwegian Web 2017 sample (Bokmål) |
58,955,519 |
| Norwegian Bokmål |
Norwegian Web 2017 (noTenTen17, Bokmål) |
2,472,483,911 |
Norwegian Web 2017 sample (Bokmål) |
58,955,519 |
| Norwegian Nynorsk |
Norwegian Web 2017 (noTenTen17, Nynorsk) |
174,830,652 |
Norwegian Web 2017 sample (Nynorsk) |
58,743,828 |
| Oromo |
Oromo Web 2016 (orWaC16) |
4,249,953 |
|
|
| Persian |
OPUS2 Persian |
4,425,133 |
|
|
| Polish |
Polish Web 2012 (plTenTen12, RFTagger) |
7,715,835,214 |
Polish Web 2012 sample |
191,648,244 |
| Portuguese |
Portuguese Web 2011 (old word sketches) |
3,896,392,719 |
Portuguese Web 2011 sample |
202,548,549 |
| Romanian |
Romanian Web 2016 (roTenTen16) |
2,640,496,763 |
|
|
| Russian |
Russian Web 2011 (ruTenTen11) |
14,553,856,113 |
Russian Web 2011 sample (ruTenTen11) |
998,099,963 |
| Samoan |
Samoan Web (SamoanWac1) |
3,115,385 |
|
|
| Scottish Gaelic |
Scottish Gaelic Wiki 2015 (gdWiki) |
980,026 |
|
|
| Serbian |
Serbian Web (srWaC 1.2 processed by Hunpos) |
477,724,164 |
|
|
| Serbian (Latin) |
Serbian Web (srWaC 1.2 processed by RFTagger v1) |
441,888,202 |
Serbian Web (srWaC 1.2 processed by RFTagger v1) |
441,888,202 |
| Setswana |
Setswana/Tswana Web (SetswanaWaC v2) |
11,496,687 |
|
|
| Slovak |
Slovak Web 2011 (skTenTen11) |
540,112,634 |
Slovak Web 2011 sample |
189,609,195 |
| Slovenian |
Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
829,544,337 |
Slovenian Web 2015 sample |
195,792,821 |
| Somali |
Somali Web 2016 (soWaC16) |
71,871,585 |
|
|
| Spanish |
Spanish Web 2018 (esTenTen18) |
17,553,075,259 |
Spanish Web 2018 sample |
177,257,648 |
| Swahili |
Swahili Web 2014 (SwahiliWaC) |
17,882,483 |
|
|
| Swedish |
Swedish Web 2014 (svTenTen14) |
3,401,035,817 |
Swedish Web 2014 sample |
45,477,881 |
| Tagalog |
Tagalog (Filipino) Web 2018 (tlTenTen18) |
151,164,040 |
|
|
| Tajik |
Tajik Web (TajikWaC) |
93,151,897 |
|
|
| Tamil |
Tamil Web 2015 (TamilWaC) |
26,750,515 |
|
|
| Tatar |
Tatar Mixed Corpus |
102,779,803 |
|
|
| Telugu |
Telugu Web (TeluguWaC) |
3,691,203 |
|
|
| Thai |
Thai Web (ThaiWaC) |
82,787,119 |
|
|
| Tibetan |
Tibetan Corpus 2 |
80,613,567 |
|
|
| Tigrinya |
Tigrinya Web 2016 (tiWaC16) |
2,087,613 |
|
|
| Turkish |
Turkish Web 2012 (trTenTen12) |
3,388,418,900 |
|
|
| Turkmen |
Turkic web – Turkmen |
2,105,359 |
|
|
| Ukrainian |
Ukrainian Web 2014 (ukTenTen14) |
2,194,447,594 |
|
|
| Urdu |
Urdu Web (UrduWaC) |
53,269,273 |
|
|
| Uzbek |
Turkic web – Uzbek |
18,720,334 |
|
|
| Vietnamese |
Vietnamese Web (VietnameseWaC) |
106,464,835 |
|
|
| Welsh |
Welsh Web 2013 (WelshWaC) |
12,458,397 |
|
|
| Yoruba |
Yoruba Web 2015 (YorubaWaC15) |
2,816,965 |
|
|