| Afrikaans |
Afrikaans Web 2024 (afTenTen24) |
141,774,668 |
| Albanian |
Albanian Web 2020 (sqTenTen20) |
528,084,150 |
| Amharic |
Amharic Web 2013-17 (amWaC17) |
25,975,846 |
| Arabic |
Arabic Web 2024 (arTenTen24) |
6,572,150,262 |
| Armenian |
Armenian Wikipedia corpus 2020 (hywiki20) |
51,349,694 |
| Assamese |
Assamese Wikipedia 2023 (asWiki23) |
2,581,684 |
| Azerbaijani |
Turkic web – Azerbaijani |
94,267,206 |
| Bashkir |
Bashkir Drama Corpus |
18,723 |
| Basque |
Basque Web (BasqueWaC v2) |
99,719,584 |
| Belarusian |
Belarusian Web 2020 (beTenTen20) |
51,297,389 |
| Bengali |
Bengali Web 2021 (bnTenTen21) |
470,732,738 |
| Bosnian |
MaCoCu Bosnian Web v1 (2021-2022) |
715,708,157 |
| Breton |
OpenSubtitles 2018 parallel – Breton |
85,503 |
| Bulgarian |
Bulgarian Web 2021 (bgTenTen21) |
4,674,884,452 |
| Cantonese |
Cantonese Web (CantoneseWaC) |
30,898,663 |
| Catalan |
Catalan Web 2014 (caTenTen14) |
182,608,420 |
| Chinese Simplified |
Chinese Web 2017 (zhTenTen17) Simplified |
13,531,331,169 |
| Chinese Traditional |
Chinese Web 2017 (zhTenTen17) Traditional |
2,400,405,372 |
| Crimean Tatar |
Crimean Tatar National Monolingual & Parallel Corpora, Crimean Tatar |
2,958,868 |
| Croatian |
MaCoCu Croatian Web v2 (2021–2022) |
2,299,750,788 |
| Czech |
Czech Web 2023 (csTenTen23) |
4,456,427,977 |
| Danish |
Danish Web 2020 (daTenTen20) |
3,480,275,804 |
| Dutch |
Dutch Web 2020 (nlTenTen20) |
5,890,009,964 |
| English |
English Web 2021 (enTenTen21) |
52,268,286,493 |
| Estonian |
Estonian Web 2023 (etTenTen23) |
1,508,458,913 |
| Filipino |
Tagalog (Filipino) Web 2019 (tlTenTen19) |
198,303,250 |
| Finnish |
Finnish Web 2024 (fiTenTen24) |
4,417,192,749 |
| French |
French Web 2023 (frTenTen23) |
23,191,789,469 |
| Frisian |
Western Frisian Web 2013 (FrisianWaC) |
3,116,119 |
| Georgian |
Georgian Web 2013 (kaWaC) |
50,713,604 |
| German |
German Web 2023 (deTenTen23) |
16,667,474,100 |
| Greek |
Greek Web 2019 (elTenTen19) |
2,342,091,029 |
| Gujarati |
Gujarati Web 2021 (guTenTen21) |
88,574,710 |
| Hausa (Boko) |
Hausa Web 2015 (hausaWaC15) |
5,304,300 |
| Hebrew |
Hebrew Web 2021 (heTenTen21) |
2,775,686,699 |
| Hindi |
Hindi Web 2021 (hiTenTen21) |
792,395,313 |
| Hungarian |
Hungarian Web 2023 (huTenTen23) |
3,494,350,960 |
| Icelandic |
Icelandic Web 2020 (isTenTen20) |
518,620,759 |
| Igbo |
Igbo Web 2015 (IgboWaC15) |
331,042 |
| Indonesian |
Indonesian Web 2024 (idTenTen24) |
7,108,841,939 |
| Irish |
Irish Web 2022 (gaTenTen22) |
125,040,541 |
| Italian |
Italian Web 2020 (itTenTen20) |
12,451,734,885 |
| Japanese |
Japanese Web 2011 sample (jaTenTen11, LUW) |
163,837,764 |
| Kannada |
Kannada Web 2012 (knWaC12) |
11,056,526 |
| Kazakh |
Turkic web – Kazakh |
139,417,763 |
| Khmer |
Khmer Web 2018 (kmTenTen18) |
16,500,379 |
| Korean |
Korean Web 2018 (koTenTen18) |
1,668,851,720 |
| Kyrgyz |
Turkic web – Kyrgyz |
19,369,507 |
| Lao |
Lao Web 2019 (loTenTen19) |
105,018,584 |
| Latin |
LatinISE historical corpus v2.2 |
11,036,900 |
| Latvian |
Latvian Web 2014 (lvTenTen14) |
530,367,474 |
| Lithuanian |
Lithuanian Web 2021 (ltTenTen21) |
1,772,410,416 |
| Macedonian |
MaCoCu Macedonian Web v2 (2021) |
512,171,886 |
| Malay |
Malay Web 2024 (msTenTen24) |
805,094,746 |
| Malayalam |
Malayalam Web (malayalamWaC) |
15,950,663 |
| Maldivian |
Maldivian Web 2022 (dvTenTen22) |
20,880,246 |
| Maltese |
Korpus Malti v2.0 |
110,714,844 |
| Maori |
Maori Web 2013 and 2020 (miTenTen20) |
11,814,825 |
| Nepali |
Nepali National Corpus |
13,440,835 |
| Norwegian |
Norwegian Web 2023 (noTenTen23, Bokmål) |
2,471,455,518 |
| Norwegian Bokmål |
Norwegian Web 2023 (noTenTen23, Bokmål) |
2,471,455,518 |
| Norwegian Nynorsk |
Norwegian Web 2023 (nnTenTen23, Nynorsk) |
151,767,346 |
| Oromo |
Oromo Web 2016 (orWaC16) |
4,249,953 |
| Persian |
TalkBank Persian (blog posts) |
269,753,238 |
| Polish |
Polish Web 2019 (plTenTen19) |
3,994,024,317 |
| Portuguese |
Portuguese Web 2023 (ptTenTen23) |
16,976,742,883 |
| Punjabi (Gurmukhi) |
Western Punjabi Web 2017 in Shahmukhi script (pnbTenTen17) |
2,806,904 |
| Romanian |
Romanian Web 2021 (roTenTen21) |
2,763,173,824 |
| Russian |
Russian Web 2020 (ruTenTen20) |
19,125,894,850 |
| Samoan |
Samoan Web (SamoanWac1) |
3,115,385 |
| Scottish Gaelic |
Scottish Gaelic Wiki 2015 (gdWiki) |
980,026 |
| Serbian |
MaCoCu Serbian Web v1 (2021-2022) |
2,435,143,021 |
| Serbian (Latin) |
Serbian Web (srWaC 1.2 processed by RFTagger v1) |
441,888,202 |
| Setswana |
Setswana/Tswana Web (SetswanaWaC v2) |
11,496,687 |
| Sinhalese |
OpenSubtitles 2018 parallel – Sinhalese |
3,430,727 |
| Slovak |
Slovak Web 2023 (skTenTen23) |
898,031,101 |
| Slovenian |
Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
829,544,337 |
| Somali |
Somali Web 2016 (soWaC16) |
71,871,585 |
| Spanish |
Spanish Web 2023 (esTenTen23) |
28,652,392,686 |
| Swahili |
Swahili Web 2014 (swWaC) |
17,882,483 |
| Swedish |
Swedish Web 2020 (svTenTen20) |
2,366,298,161 |
| Tagalog |
Tagalog (Filipino) Web 2019 (tlTenTen19) |
198,303,250 |
| Tajik |
Tajik Web (TajikWaC) |
93,151,897 |
| Tamil |
Tamil Web 2021 (taTenTen21) |
823,837,031 |
| Tatar |
Tatar Mixed Corpus |
102,779,803 |
| Telugu |
Telugu Web (TeluguWaC) |
3,691,203 |
| Thai |
Thai Web 2018 (thTenTen18) |
640,530,227 |
| Tigrinya |
Tigrinya Web 2016 (tiWaC16) |
2,087,613 |
| Turkish |
Turkish Web 2020 (trTenTen20) |
4,980,168,485 |
| Turkmen |
Turkic web – Turkmen |
2,105,359 |
| Ukrainian |
Ukrainian Web 2022 (ukTenTen22) |
7,594,784,148 |
| Urdu |
Urdu Web (UrduWaC) |
53,269,273 |
| Uzbek |
Turkic web – Uzbek |
18,720,334 |
| Vietnamese |
Vietnamese Web 2017 (viTenTen17) |
6,056,899,600 |
| Welsh |
Welsh Web 2013 (WelshWaC) |
12,458,397 |
| Yoruba |
Yoruba Web 2015 (YorubaWaC15) |
2,816,965 |