| Afrikaans | Afrikaans Web 2024 (afTenTen24) | 141,774,668 | 
| Albanian | Albanian Web 2020 (sqTenTen20) | 528,084,150 | 
| Amharic | Amharic Web 2013-17 (amWaC17) | 25,975,846 | 
| Arabic | Arabic Web 2024 (arTenTen24) | 6,572,150,262 | 
| Armenian | Armenian Wikipedia corpus 2020 (hywiki20) | 51,349,694 | 
| Assamese | Assamese Wikipedia 2023 (asWiki23) | 2,581,684 | 
| Azerbaijani | Turkic web – Azerbaijani | 94,267,206 | 
| Bashkir | Bashkir Drama Corpus | 18,723 | 
| Basque | Basque Web (BasqueWaC v2) | 99,719,584 | 
| Belarusian | Belarusian Web 2020 (beTenTen20) | 51,297,389 | 
| Bengali | Bengali Web 2021 (bnTenTen21) | 470,732,738 | 
| Bosnian | MaCoCu Bosnian Web v1 (2021-2022) | 715,708,157 | 
| Breton | OpenSubtitles 2018 parallel – Breton | 85,503 | 
| Bulgarian | Bulgarian Web 2021 (bgTenTen21) | 4,674,884,452 | 
| Cantonese | Cantonese Web (CantoneseWaC) | 30,898,663 | 
| Catalan | Catalan Web 2014 (caTenTen14) | 182,608,420 | 
| Chinese Simplified | Chinese Web 2017 (zhTenTen17) Simplified | 13,531,331,169 | 
| Chinese Traditional | Chinese Web 2017 (zhTenTen17) Traditional | 2,400,405,372 | 
| Crimean Tatar | Crimean Tatar National Monolingual & Parallel Corpora, Crimean Tatar | 2,958,868 | 
| Croatian | MaCoCu Croatian Web v2 (2021–2022) | 2,299,750,788 | 
| Czech | Czech Web 2023 (csTenTen23) | 4,456,427,977 | 
| Danish | Danish Web 2020 (daTenTen20) | 3,480,275,804 | 
| Dutch | Dutch Web 2020 (nlTenTen20) | 5,890,009,964 | 
| English | English Web 2021 (enTenTen21) | 52,268,286,493 | 
| Estonian | Estonian Web 2023 (etTenTen23) | 1,508,458,913 | 
| Filipino | Tagalog (Filipino) Web 2019 (tlTenTen19) | 198,303,250 | 
| Finnish | Finnish Web 2024 (fiTenTen24) | 4,417,192,749 | 
| French | French Web 2023 (frTenTen23) | 23,191,789,469 | 
| Frisian | Western Frisian Web 2013 (FrisianWaC) | 3,116,119 | 
| Georgian | Georgian Web 2013 (kaWaC) | 50,713,604 | 
| German | German Web 2023 (deTenTen23) | 16,667,474,100 | 
| Greek | Greek Web 2019 (elTenTen19) | 2,342,091,029 | 
| Gujarati | Gujarati Web 2021 (guTenTen21) | 88,574,710 | 
| Hausa (Boko) | Hausa Web 2015 (hausaWaC15) | 5,304,300 | 
| Hebrew | Hebrew Web 2021 (heTenTen21) | 2,775,686,699 | 
| Hindi | Hindi Web 2021 (hiTenTen21) | 792,395,313 | 
| Hungarian | Hungarian Web 2023 (huTenTen23) | 3,494,350,960 | 
| Icelandic | Icelandic Web 2020 (isTenTen20) | 518,620,759 | 
| Igbo | Igbo Web 2015 (IgboWaC15) | 331,042 | 
| Indonesian | Indonesian Web 2024 (idTenTen24) | 7,108,841,939 | 
| Irish | Irish Web 2022 (gaTenTen22) | 125,040,541 | 
| Italian | Italian Web 2020 (itTenTen20) | 12,451,734,885 | 
| Japanese | Japanese Web 2011 sample (jaTenTen11, LUW) | 163,837,764 | 
| Kannada | Kannada Web 2012 (knWaC12) | 11,056,526 | 
| Kazakh | Turkic web – Kazakh | 139,417,763 | 
| Khmer | Khmer Web 2018 (kmTenTen18) | 16,500,379 | 
| Korean | Korean Web 2018 (koTenTen18) | 1,668,851,720 | 
| Kyrgyz | Turkic web – Kyrgyz | 19,369,507 | 
| Lao | Lao Web 2019 (loTenTen19) | 105,018,584 | 
| Latin | LatinISE historical corpus v2.2 | 11,036,900 | 
| Latvian | Latvian Web 2014 (lvTenTen14) | 530,367,474 | 
| Lithuanian | Lithuanian Web 2021 (ltTenTen21) | 1,772,410,416 | 
| Macedonian | MaCoCu Macedonian Web v2 (2021) | 512,171,886 | 
| Malay | Malay Web 2024 (msTenTen24) | 805,094,746 | 
| Malayalam | Malayalam Web (malayalamWaC) | 15,950,663 | 
| Maldivian | Maldivian Web 2022 (dvTenTen22) | 20,880,246 | 
| Maltese | Korpus Malti v2.0 | 110,714,844 | 
| Maori | Maori Web 2013 and 2020 (miTenTen20) | 11,814,825 | 
| Nepali | Nepali National Corpus | 13,440,835 | 
| Norwegian | Norwegian Web 2023 (noTenTen23, Bokmål) | 2,471,455,518 | 
| Norwegian Bokmål | Norwegian Web 2023 (noTenTen23, Bokmål) | 2,471,455,518 | 
| Norwegian Nynorsk | Norwegian Web 2023 (nnTenTen23, Nynorsk) | 151,767,346 | 
| Oromo | Oromo Web 2016 (orWaC16) | 4,249,953 | 
| Persian | TalkBank Persian (blog posts) | 269,753,238 | 
| Polish | Polish Web 2019 (plTenTen19) | 3,994,024,317 | 
| Portuguese | Portuguese Web 2023 (ptTenTen23) | 16,976,742,883 | 
| Punjabi (Gurmukhi) | Western Punjabi Web 2017 in Shahmukhi script (pnbTenTen17) | 2,806,904 | 
| Romanian | Romanian Web 2021 (roTenTen21) | 2,763,173,824 | 
| Russian | Russian Web 2020 (ruTenTen20) | 19,125,894,850 | 
| Samoan | Samoan Web (SamoanWac1) | 3,115,385 | 
| Scottish Gaelic | Scottish Gaelic Wiki 2015 (gdWiki) | 980,026 | 
| Serbian | MaCoCu Serbian Web v1 (2021-2022) | 2,435,143,021 | 
| Serbian (Latin) | Serbian Web (srWaC 1.2 processed by RFTagger v1) | 441,888,202 | 
| Setswana | Setswana/Tswana Web (SetswanaWaC v2) | 11,496,687 | 
| Sinhalese | OpenSubtitles 2018 parallel – Sinhalese | 3,430,727 | 
| Slovak | Slovak Web 2023 (skTenTen23) | 898,031,101 | 
| Slovenian | Slovenian Web 2015 (slTenTen15, TreeTagger v2) | 829,544,337 | 
| Somali | Somali Web 2016 (soWaC16) | 71,871,585 | 
| Spanish | Spanish Web 2023 (esTenTen23) | 28,652,392,686 | 
| Swahili | Swahili Web 2014 (swWaC) | 17,882,483 | 
| Swedish | Swedish Web 2020 (svTenTen20) | 2,366,298,161 | 
| Tagalog | Tagalog (Filipino) Web 2019 (tlTenTen19) | 198,303,250 | 
| Tajik | Tajik Web (TajikWaC) | 93,151,897 | 
| Tamil | Tamil Web 2021 (taTenTen21) | 823,837,031 | 
| Tatar | Tatar Mixed Corpus | 102,779,803 | 
| Telugu | Telugu Web (TeluguWaC) | 3,691,203 | 
| Thai | Thai Web 2018 (thTenTen18) | 640,530,227 | 
| Tigrinya | Tigrinya Web 2016 (tiWaC16) | 2,087,613 | 
| Turkish | Turkish Web 2020 (trTenTen20) | 4,980,168,485 | 
| Turkmen | Turkic web – Turkmen | 2,105,359 | 
| Ukrainian | Ukrainian Web 2022 (ukTenTen22) | 7,594,784,148 | 
| Urdu | Urdu Web (UrduWaC) | 53,269,273 | 
| Uzbek | Turkic web – Uzbek | 18,720,334 | 
| Vietnamese | Vietnamese Web 2017 (viTenTen17) | 6,056,899,600 | 
| Welsh | Welsh Web 2013 (WelshWaC) | 12,458,397 | 
| Yoruba | Yoruba Web 2015 (YorubaWaC15) | 2,816,965 |