ACL Anthology Reference Corpus (ARC) |
English |
open |
62,196,334 |
Afrikaans Web 2024 (afTenTen24) |
Afrikaans |
trial |
142,303,550 |
Afrikaans Wikipedia 2022 |
Afrikaans |
trial |
22,227,137 |
Afrikaans Wikipedia corpus 2018 (afwiki) |
Afrikaans |
main |
14,466,792 |
Albanian Web 2020 (sqTenTen20) |
Albanian |
trial |
528,084,150 |
Alsatian Drama Corpus |
German |
main |
276,204 |
American Spanish Web 2011 (esamTenTen11) |
Spanish |
main |
7,475,579,365 |
Amharic Web 2013-17 (amWaC17) |
Amharic |
trial |
25,975,846 |
ArabCC – Learner Corpus of English Essays |
English |
main |
202,364 |
Arabic Learner Corpus (ALC) |
Arabic |
main |
362,712 |
Arabic Trends (2014–today) |
Arabic |
trial |
6,309,598,423 |
Arabic Web 2009 |
Arabic |
main |
150,282,522 |
Arabic Web 2012 (arTenTen12) |
Arabic |
main |
7,475,624,779 |
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) |
Arabic |
main |
115,315,274 |
Arabic Web 2024 (arTenTen24) |
Arabic |
trial |
6,572,150,262 |
Araneum Anglicum Africanum Maius [2015] |
English |
main |
854,484,093 |
Araneum Anglicum Asiaticum Maius [2015] |
English |
main |
867,259,037 |
Araneum Anglicum Maius [2015] |
English |
trial |
888,466,066 |
Araneum Finnicum Maius [2014] |
Finnish |
main |
817,453,523 |
Araneum Francogallicum Maius [2015] |
French |
main |
933,688,995 |
Araneum Germanicum Maius [2013] |
German |
main |
875,465,845 |
Araneum Hispanicum Maius [2013] |
Spanish |
main |
892,299,770 |
Araneum Hungaricum Maius [2014] |
Hungarian |
trial |
792,549,686 |
Araneum Italicum Maius (Italian, 14.12) 1,20 G |
Italian |
main |
890,568,531 |
Araneum Nederlandicum Maius [2013] |
Dutch |
main |
713,417,518 |
Araneum Polonicum Maius [2013] |
Polish |
main |
595,768,667 |
Araneum Portugallicum Maius [2015] |
Portuguese |
main |
862,134,902 |
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G |
Russian |
trial |
859,319,823 |
Araneum Slovacum Maius [2013] |
Slovak |
trial |
816,125,010 |
Armenian Wikipedia corpus 2020 (hywiki20) |
Armenian |
trial |
51,349,694 |
Assamese Wikipedia 2023 (asWiki23) |
Assamese |
trial |
2,581,684 |
Australian Legislative Corpus 2023 |
English |
ondemand |
138,411,932 |
Bashkir Drama Corpus |
Bashkir |
main |
18,723 |
Basque Web (BasqueWaC v2) |
Basque |
trial |
99,719,584 |
Belarusian Web 2016 (beTenTen16) |
Belarusian |
trial |
63,327,264 |
Belgian parliamentary debates (ParlaMint 2.1) |
French |
trial |
30,865,918 |
Belgian parliamentary debates (ParlaMint 2.1, CoNLL format) |
French |
trial |
30,864,767 |
Bengali Web (bnWaC) |
Bengali |
main |
11,519,730 |
Bengali Web 2017 (bnTenTen17) |
Bengali |
main |
812,606,941 |
Bengali Web 2021 (bnTenTen21) |
Bengali |
trial |
470,732,738 |
BIBLE Polish-Swahili |
Polish |
main |
138,216 |
BIBLE Swahili-Polish |
Swahili |
main |
139,160 |
Boot Camp English |
English |
trial |
85,683,246 |
Bosnian Web (bsWaC 1.2) |
Bosnian |
trial |
248,478,730 |
Brexit corpus (English) |
English |
trial |
108,452,923 |
Brexit corpus without retweets (English) |
English |
trial |
4,789,571 |
British Academic Spoken English Corpus (BASE) |
English |
open |
1,477,281 |
British Academic Written English Corpus (BAWE) |
English |
open |
6,968,089 |
British Law Report Corpus |
English |
main |
8,515,749 |
British National Corpus (BNC) |
English |
trial |
96,132,981 |
British National Corpus (BNC), tagged by CLAWS |
English |
trial |
96,052,598 |
British National Corpus 2014 (BNC2014, spoken part) |
English |
trial |
10,495,185 |
British parliamentary debates (ParlaMint 2.1, CoNLL format) |
English |
trial |
100,967,492 |
British Web 2007 (ukWaC) |
English |
main |
1,313,058,436 |
Brown Corpus |
English |
open |
1,007,299 |
Brown Family |
English |
main |
6,963,778 |
Brown Family (CLAWS + TreeTagger tags) |
English |
main |
6,975,474 |
Bulgarian National Corpus (BulgarianNC) |
Bulgarian |
main |
20,975,703 |
Bulgarian National Corpus nonweb genres |
Bulgarian |
main |
22,398,507 |
Bulgarian National Corpus with web |
Bulgarian |
main |
419,512,059 |
Bulgarian parliamentary debates (ParlaMint 2.1) |
Bulgarian |
trial |
19,099,991 |
Bulgarian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Bulgarian |
trial |
19,096,761 |
Bulgarian Web 2012 (bgTenTen12) |
Bulgarian |
main |
705,156,683 |
Bulgarian Web 2021 (bgTenTen21) |
Bulgarian |
trial |
4,695,125,771 |
Burmese Web 2021 (myTenTen21) |
Burmese |
trial |
557,329,406 |
Cambridge Academic English |
English |
main |
3,163,648 |
Cantonese Web (CantoneseWaC) |
Cantonese |
trial |
30,898,663 |
Catalan Web 2014 (caTenTen14) |
Catalan |
trial |
182,608,420 |
Cebuano Web 2018 (cebTenTen18) |
Cebuano |
trial |
4,552,105 |
CELEN: Learner Corpus of Spanish in Japan |
Spanish |
open |
658,467 |
CHILDES Afrikaans Corpus |
Afrikaans |
main |
26,020 |
CHILDES Catalan Corpus |
Catalan |
main |
209,525 |
CHILDES Croatian Corpus |
Croatian |
main |
300,832 |
CHILDES Danish Corpus |
Danish |
main |
285,231 |
CHILDES English Corpus |
English |
main |
22,693,506 |
CHILDES Estonian Corpus |
Estonian |
main |
313,457 |
CHILDES Farsi Corpus |
Persian |
main |
120,527 |
CHILDES French Corpus |
French |
main |
2,583,460 |
CHILDES Gaelic Corpus |
Irish |
main |
16,848 |
CHILDES German Corpus |
German |
main |
5,941,266 |
CHILDES Hebrew Corpus |
Hebrew |
main |
807,657 |
CHILDES Hungarian Corpus |
Hungarian |
main |
247,881 |
CHILDES Italian Corpus |
Italian |
main |
459,881 |
CHILDES Japanese Corpus |
Japanese |
main |
1,578,068 |
CHILDES Korean Corpus |
Korean |
main |
36,056 |
CHILDES Norwegian Corpus |
Norwegian |
main |
56,827 |
CHILDES Polish Corpus |
Polish |
main |
1,041,300 |
CHILDES Portuguese Corpus |
Portuguese |
main |
216,407 |
CHILDES Russian Corpus |
Russian |
main |
48,791 |
CHILDES Spanish Corpus |
Spanish |
main |
802,743 |
CHILDES Swedish Corpus |
Swedish |
main |
520,478 |
CHILDES Tamil Corpus |
Tamil |
main |
15,490 |
CHILDES Thai Corpus |
Thai |
main |
243,939 |
CHILDES Turkish Corpus |
Turkish |
main |
178,100 |
Chinese GigaWord 2 Corpus: Mainland, simplified |
Chinese Simplified |
main |
205,031,379 |
Chinese GigaWord 2 Corpus: Taiwan, traditional |
Chinese Traditional |
main |
382,600,557 |
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) |
Chinese Traditional |
main |
259,156,002 |
Chinese Traditional Web 2011 (TaiwanWaC) |
Chinese Traditional |
main |
259,156,002 |
Chinese Trends (2023–today) |
Chinese Simplified |
trial |
23,741,157 |
Chinese Web 2005 (Internet-ZH, NEUCSP tagger) |
Chinese Simplified |
main |
198,205,344 |
Chinese Web 2011 (zhTenTen11, sample 10M) |
Chinese Simplified |
main |
9,012,125 |
Chinese Web 2011 (zhTenTen11, Stanford tagger) |
Chinese Simplified |
main |
1,729,867,455 |
Chinese Web 2017 (zhTenTen17) Simplified |
Chinese Simplified |
trial |
13,531,331,169 |
Chinese Web 2017 (zhTenTen17) Traditional |
Chinese Traditional |
trial |
2,400,405,372 |
COLEM |
Spanish |
open |
1,677,597 |
COMPAS 2015 |
English |
ondemand |
114,967,191 |
COMPAS 2016 |
English |
ondemand |
260,896,404 |
CoPEP - The Corpus of Portuguese from Academic Journals (v. 1.4) |
Portuguese |
main |
40,423,011 |
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ |
N'Ko |
open |
4,102,593 |
Corpus of Academic Journal Articles (CAJA) |
English |
ondemand |
78,970,299 |
Corpus of English Dialogues 1560–1760 |
English |
ondemand |
1,151,171 |
Corpus of Estonian Web sentences 2020 |
Estonian |
main |
280,961,465 |
Corpus of Estonian Web sentences 2021 |
Estonian |
main |
473,455,876 |
Corpus of the MagyarOK teaching materials for Hungarian, levels A1 to B2 |
Hungarian |
open |
259,200 |
COVID-19 Open Research Dataset (CORD-19) |
English |
open |
1,443,530,655 |
Crimean Tatar National Monolingual & Parallel Corpora, Crimean Tatar |
Crimean Tatar |
open |
2,958,868 |
Crimean Tatar National Monolingual & Parallel Corpora, English |
English |
open |
92,947 |
Crimean Tatar National Monolingual & Parallel Corpora, Russian |
Russian |
open |
538,135 |
Crimean Tatar National Monolingual & Parallel Corpora, Ukrainian |
Ukrainian |
open |
344,454 |
Croatian parliamentary debates (ParlaMint 2.1) |
Croatian |
trial |
20,337,753 |
Croatian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Croatian |
trial |
20,342,230 |
Croatian Web (hrWaC 2.2, ReLDI) |
Croatian |
trial |
1,210,021,198 |
Croatian Web (hrWaC 2.2, RFTagger) |
Croatian |
trial |
1,211,328,660 |
csSkELL v1 (whole documents) |
Czech |
main |
1,717,516,129 |
csSkELL v2.2 (sentences with GDEX scores) |
Czech |
main |
1,443,410,941 |
Cundeelee Wangka Stories (Cundeelee Wangka) |
Cundeelee Wangka |
ondemand |
1,965 |
Cundeelee Wangka Stories (English) |
English |
ondemand |
4,423 |
Czech Drama Corpus |
Czech |
main |
135,105 |
Czech news and web 1995–2002 (czes2.2) |
Czech |
main |
366,796,757 |
Czech parliamentary debates (ParlaMint 2.1) |
Czech |
trial |
22,087,036 |
Czech parliamentary debates (ParlaMint 2.1, CoNLL format) |
Czech |
trial |
22,104,199 |
Czech Trends (2014–today) |
Czech |
trial |
1,985,749,265 |
Czech Web (csTenTen 12+17+19) |
Czech |
trial |
11,722,066,502 |
Czech Web 2012 (csTenTen12 v9a) |
Czech |
main |
4,175,089,441 |
Czech Web 2019 (csTenTen19) |
Czech |
main |
6,280,217,621 |
Czech Web 2023 (csTenTen23) |
Czech |
trial |
4,456,427,977 |
CzechParl 2012 (v2 with lempos) |
Czech |
main |
37,184,025 |
Danish Gigaword (DAGW) |
Danish |
main |
964,617,784 |
Danish parliamentary debates (ParlaMint 2.1) |
Danish |
trial |
29,225,255 |
Danish parliamentary debates (ParlaMint 2.1, CoNLL format) |
Danish |
trial |
29,205,018 |
Danish Trends |
Danish |
trial |
91,371,408 |
Danish Web 2010 (DanishWaC) |
Danish |
main |
288,272,967 |
Danish Web 2014 (daTenTen14) |
Danish |
main |
2,040,976,501 |
Danish Web 2017 (daTenTen17) |
Danish |
main |
1,956,590,663 |
Danish Web 2020 (daTenTen20) |
Danish |
trial |
3,480,275,804 |
DGT-Translation Memory parallel – Bulgarian |
Bulgarian |
main |
25,912,721 |
DGT-Translation Memory parallel – Croatian |
Croatian |
main |
3,968,608 |
DGT-Translation Memory parallel – Czech |
Czech |
main |
43,621,933 |
DGT-Translation Memory parallel – Danish |
Danish |
main |
44,962,280 |
DGT-Translation Memory parallel – Dutch |
Dutch |
main |
50,523,892 |
DGT-Translation Memory parallel – English |
English |
main |
59,106,576 |
DGT-Translation Memory parallel – Estonian |
Estonian |
main |
34,155,488 |
DGT-Translation Memory parallel – Finnish |
Finnish |
main |
35,129,923 |
DGT-Translation Memory parallel – French |
French |
main |
58,224,781 |
DGT-Translation Memory parallel – German |
German |
main |
45,380,666 |
DGT-Translation Memory parallel – Greek |
Greek |
main |
51,865,988 |
DGT-Translation Memory parallel – Hungarian |
Hungarian |
main |
2,306,272 |
DGT-Translation Memory parallel – Irish |
Irish |
main |
1,065,421 |
DGT-Translation Memory parallel – Italian |
Italian |
main |
53,260,912 |
DGT-Translation Memory parallel – Latvian |
Latvian |
main |
38,898,134 |
DGT-Translation Memory parallel – Lithuanian |
Lithuanian |
main |
38,675,242 |
DGT-Translation Memory parallel – Maltese |
Maltese |
main |
22,388,562 |
DGT-Translation Memory parallel – Polish |
Polish |
main |
44,149,107 |
DGT-Translation Memory parallel – Portuguese |
Portuguese |
main |
53,950,705 |
DGT-Translation Memory parallel – Romanian |
Romanian |
main |
26,644,734 |
DGT-Translation Memory parallel – Slovak |
Slovak |
main |
43,276,048 |
DGT-Translation Memory parallel – Slovenian |
Slovenian |
main |
42,897,385 |
DGT-Translation Memory parallel – Spanish |
Spanish |
main |
57,311,149 |
DGT-Translation Memory parallel – Swedish |
Swedish |
main |
44,378,725 |
Directory of Open Access Journals (DOAJ) – English |
English |
trial |
2,662,763,697 |
Duch parliamentary debates (ParlaMint 2.1) |
Dutch |
trial |
51,175,668 |
Dutch parliamentary debates (ParlaMint 2.1, CoNLL format) |
Dutch |
trial |
51,156,406 |
Dutch Trends |
Dutch |
trial |
259,598,664 |
Dutch Web 2014 (nlTenTen14) |
Dutch |
main |
2,253,777,579 |
Dutch Web 2020 (nlTenTen20) |
Dutch |
trial |
5,890,009,964 |
e-flux (International art English) |
English |
main |
5,036,119 |
EcoLexicon English Corpus (EEC) |
English |
open |
23,169,446 |
ELEXIS Bulgarian Web 2021 |
Bulgarian |
main |
1,014,316,771 |
ELEXIS Bulgarian Web 2021 (bgTenTen21) WSD sample |
Bulgarian |
main |
1,992,046 |
ELEXIS Croatian Web 2020 |
Croatian |
main |
1,006,040,496 |
ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample |
Croatian |
main |
1,964,238 |
ELEXIS Czech Web 2019 |
Czech |
main |
949,730,627 |
ELEXIS Czech Web 2019 (csTenTen19) WSD sample |
Czech |
main |
1,970,054 |
ELEXIS Danish Web 2020 |
Danish |
main |
989,769,308 |
ELEXIS Danish Web 2020 (daTenTen20) WSD sample |
Danish |
main |
1,982,549 |
ELEXIS Dutch Web 2020 |
Dutch |
main |
1,024,660,354 |
ELEXIS Dutch Web 2020 (nlTenTen20) WSD sample |
Dutch |
main |
1,982,397 |
ELEXIS English Web 2020 |
English |
main |
1,000,329,442 |
ELEXIS English Web 2020 (enTenTen20, no genres and topics) WSD sample |
English |
main |
1,999,789 |
ELEXIS Estonian Web 2021 |
Estonian |
main |
1,006,940,696 |
ELEXIS Estonian Web 2021 (etTenTen21) WSD sample |
Estonian |
main |
1,995,380 |
ELEXIS Finnish Web 2019 |
Finnish |
main |
1,011,352,644 |
ELEXIS Finnish Web 2019 (fiTenTen19) WSD sample |
Finnish |
main |
1,993,821 |
ELEXIS French Web 2020 |
French |
main |
1,069,392,783 |
ELEXIS French Web 2020 (frTenTen20) WSD sample |
French |
main |
2,099,651 |
ELEXIS German Web 2020 |
German |
main |
1,023,830,342 |
ELEXIS German Web 2020 (deTenTen20) WSD sample |
German |
main |
1,998,166 |
ELEXIS Greek Web 2019 |
Greek |
main |
1,003,265,093 |
ELEXIS Greek Web 2019 (elTenTen19) WSD sample |
Greek |
main |
1,961,351 |
ELEXIS Hebrew Web 2021 |
Hebrew |
main |
1,043,504,840 |
ELEXIS Hebrew Web 2021 (heTenTen21) WSD sample |
Hebrew |
main |
2,017,821 |
ELEXIS Hungarian Web 2020 |
Hungarian |
main |
994,806,145 |
ELEXIS Hungarian Web 2020 (huTenTen20) WSD sample |
Hungarian |
main |
1,989,855 |
ELEXIS Irish Web 2021 |
Irish |
main |
58,130,702 |
ELEXIS Irish Web 2021 (gaTenTen21) WSD sample |
Irish |
main |
1,980,914 |
ELEXIS Italian Web 2020 |
Italian |
main |
1,020,349,212 |
ELEXIS Italian Web 2020 (itTenTen20) WSD sample |
Italian |
main |
1,996,623 |
ELEXIS Latvian Web 2021 |
Latvian |
main |
1,029,262,793 |
ELEXIS Latvian Web 2021 (lvTenTen21) WSD sample |
Latvian |
main |
2,006,576 |
ELEXIS Lithuanian Web 2021 |
Lithuanian |
main |
846,563,251 |
ELEXIS Lithuanian Web 2021 (ltTenTen21) WSD sample |
Lithuanian |
main |
2,004,075 |
ELEXIS Polish Web 2019 |
Polish |
main |
987,945,132 |
ELEXIS Polish Web 2019 (plTenTen19) WSD sample |
Polish |
main |
1,971,906 |
ELEXIS Portuguese Web 2020 |
Portuguese |
main |
1,021,937,614 |
ELEXIS Portuguese Web 2020 (ptTenTen20) WSD sample |
Portuguese |
main |
1,997,515 |
ELEXIS Romanian Web 2021 |
Romanian |
main |
995,033,835 |
ELEXIS Romanian Web 2021 (roTenTen21) WSD sample |
Romanian |
main |
1,968,801 |
ELEXIS Slovak Web 2021 |
Slovak |
main |
1,008,238,227 |
ELEXIS Slovak Web 2021 (skTenTen21) WSD sample |
Slovak |
main |
1,975,380 |
ELEXIS Slovene Web 2020 (slTenTen20) WSD sample |
Slovenian |
main |
1,964,284 |
ELEXIS Slovenian Web 2020 |
Slovenian |
main |
1,007,206,400 |
ELEXIS Spanish Web 2020 |
Spanish |
main |
1,012,502,656 |
ELEXIS Spanish Web 2020 (esTenTen20) WSD sample |
Spanish |
main |
1,988,999 |
ELEXIS Swedish Web 2020 |
Swedish |
main |
1,006,477,461 |
ELEXIS Swedish Web 2020 (svTenTen20) WSD sample |
Swedish |
main |
1,980,144 |
Elsevier OA CC-BY Corpus |
English |
main |
187,615,459 |
English Broadsheet Newspapers 1993–2021 (SiBol) |
English |
main |
858,566,374 |
English Corpus for SKELL 3.10 |
English |
main |
1,038,200,313 |
English Corpus for SkELL 3.8 |
English |
main |
1,041,772,774 |
English Corpus for SkELL 3.9 |
English |
main |
1,041,138,575 |
English Drama Corpus |
English |
main |
18,846,687 |
English Historical Book Collection (EEBO, ECCO, Evans) |
English |
main |
826,296,048 |
English parliamentary debates (ParlaMint 2.1) |
English |
trial |
100,616,051 |
English Preposition Corpus |
English |
trial |
2,136,325 |
English Trends (2014–today) |
English |
trial |
82,670,198,070 |
English Web 2008 (ententen08_tt31) |
English |
trial |
3,083,193,293 |
English Web 2012 (enTenTen12) |
English |
main |
11,191,860,036 |
English Web 2013 (enTenTen13) |
English |
main |
19,685,733,337 |
English Web 2015 (enTenTen15) |
English |
main |
13,190,556,334 |
English Web 2018 (enTenTen18) |
English |
main |
21,926,740,748 |
English Web 2021 (enTenTen21) |
English |
trial |
52,268,286,493 |
English Wikipedia |
English |
main |
1,356,523,079 |
English Wikipedia sample with Error annotations |
English |
trial |
951,824 |
Environment corpus |
English |
main |
61,197,742 |
Estonian Corpus for Learners 2020 (etSkELL) |
Estonian |
main |
280,572,215 |
Estonian coursebook corpus 2018 |
Estonian |
main |
121,114 |
Estonian National Corpus 2021 (Estonian NC 2021) |
Estonian |
main |
2,410,296,919 |
Estonian National Corpus 2021 (Estonian NC 2021, CoNLL format) |
Estonian |
main |
2,410,296,919 |
Estonian National Corpus 2023 (Estonian NC 2023) |
Estonian |
main |
3,080,721,728 |
Estonian Trends |
Estonian |
trial |
203,282,152 |
Estonian Web 2017 (etTenTen17) |
Estonian |
main |
658,558,136 |
Estonian Web 2019 (etTenTen19) |
Estonian |
main |
508,447,009 |
Estonian Web 2021 (etTenTen21) |
Estonian |
trial |
725,832,092 |
EUR-Lex 2/2016 parallel – Bulgarian |
Bulgarian |
trial |
329,071,554 |
EUR-Lex 2/2016 parallel – Croatian |
Croatian |
trial |
109,138,184 |
EUR-Lex 2/2016 parallel – Czech |
Czech |
trial |
350,230,088 |
EUR-Lex 2/2016 parallel – Danish |
Danish |
trial |
519,765,085 |
EUR-Lex 2/2016 parallel – Dutch |
Dutch |
trial |
583,263,688 |
EUR-Lex 2/2016 parallel – English |
English |
trial |
629,722,593 |
EUR-Lex 2/2016 parallel – Estonian |
Estonian |
trial |
291,077,511 |
EUR-Lex 2/2016 parallel – Finnish |
Finnish |
trial |
384,119,975 |
EUR-Lex 2/2016 parallel – French |
French |
trial |
677,063,993 |
EUR-Lex 2/2016 parallel – German |
German |
trial |
528,617,843 |
EUR-Lex 2/2016 parallel – Greek |
Greek |
trial |
579,344,223 |
EUR-Lex 2/2016 parallel – Hungarian |
Hungarian |
trial |
340,618,970 |
EUR-Lex 2/2016 parallel – Irish |
Irish |
trial |
31,439,542 |
EUR-Lex 2/2016 parallel – Italian |
Italian |
trial |
606,070,097 |
EUR-Lex 2/2016 parallel – Latvian |
Latvian |
trial |
324,734,544 |
EUR-Lex 2/2016 parallel – Lithuanian |
Lithuanian |
trial |
323,151,426 |
EUR-Lex 2/2016 parallel – Maltese |
Maltese |
trial |
314,396,006 |
EUR-Lex 2/2016 parallel – Polish |
Polish |
trial |
360,862,149 |
EUR-Lex 2/2016 parallel – Portuguese |
Portuguese |
trial |
595,066,701 |
EUR-Lex 2/2016 parallel – Romanian |
Romanian |
trial |
336,928,068 |
EUR-Lex 2/2016 parallel – Slovak |
Slovak |
trial |
255,531,673 |
EUR-Lex 2/2016 parallel – Slovenian |
Slovenian |
trial |
351,899,258 |
EUR-Lex 2/2016 parallel – Spanish |
Spanish |
trial |
635,187,126 |
EUR-Lex 2/2016 parallel – Swedish |
Swedish |
trial |
478,485,126 |
EUR-Lex judgments 12/2016 parallel – Bulgarian |
Bulgarian |
trial |
17,071,495 |
EUR-Lex judgments 12/2016 parallel – Croatian |
Croatian |
trial |
5,613,468 |
EUR-Lex judgments 12/2016 parallel – Czech |
Czech |
trial |
18,226,505 |
EUR-Lex judgments 12/2016 parallel – Danish |
Danish |
trial |
34,934,021 |
EUR-Lex judgments 12/2016 parallel – Dutch |
Dutch |
trial |
40,534,071 |
EUR-Lex judgments 12/2016 parallel – English |
English |
trial |
42,339,337 |
EUR-Lex judgments 12/2016 parallel – Estonian |
Estonian |
trial |
15,029,608 |
EUR-Lex judgments 12/2016 parallel – Finnish |
Finnish |
trial |
23,601,422 |
EUR-Lex judgments 12/2016 parallel – French |
French |
trial |
48,023,524 |
EUR-Lex judgments 12/2016 parallel – German |
German |
trial |
35,297,517 |
EUR-Lex judgments 12/2016 parallel – Greek |
Greek |
trial |
35,815,108 |
EUR-Lex judgments 12/2016 parallel – Hungarian |
Hungarian |
trial |
17,940,879 |
EUR-Lex judgments 12/2016 parallel – Italian |
Italian |
trial |
42,053,315 |
EUR-Lex judgments 12/2016 parallel – Latvian |
Latvian |
trial |
16,908,831 |
EUR-Lex judgments 12/2016 parallel – Lithuanian |
Lithuanian |
trial |
16,252,111 |
EUR-Lex judgments 12/2016 parallel – Maltese |
Maltese |
trial |
19,146,797 |
EUR-Lex judgments 12/2016 parallel – Polish |
Polish |
trial |
18,799,551 |
EUR-Lex judgments 12/2016 parallel – Portuguese |
Portuguese |
trial |
35,412,936 |
EUR-Lex judgments 12/2016 parallel – Romanian |
Romanian |
trial |
17,592,388 |
EUR-Lex judgments 12/2016 parallel – Slovak |
Slovak |
trial |
18,265,664 |
EUR-Lex judgments 12/2016 parallel – Slovenian |
Slovenian |
trial |
18,439,766 |
EUR-Lex judgments 12/2016 parallel – Spanish |
Spanish |
trial |
39,431,836 |
EUR-Lex judgments 12/2016 parallel – Swedish |
Swedish |
trial |
30,666,764 |
Europarl spoken parallel – Bulgarian |
Bulgarian |
trial |
9,215,233 |
Europarl spoken parallel – Czech |
Czech |
trial |
13,013,774 |
Europarl spoken parallel – Danish |
Danish |
trial |
48,343,860 |
Europarl spoken parallel – Dutch |
Dutch |
trial |
54,007,722 |
Europarl spoken parallel – English |
English |
trial |
53,837,625 |
Europarl spoken parallel – English |
English |
open |
15,099,625 |
Europarl spoken parallel – Estonian |
Estonian |
trial |
11,171,727 |
Europarl spoken parallel – Finnish |
Finnish |
trial |
34,182,031 |
Europarl spoken parallel – French |
French |
trial |
59,145,988 |
Europarl spoken parallel – French |
French |
open |
16,815,290 |
Europarl spoken parallel – German |
German |
trial |
47,805,055 |
Europarl spoken parallel – Greek |
Greek |
trial |
38,868,863 |
Europarl spoken parallel – Hungarian |
Hungarian |
trial |
12,421,715 |
Europarl spoken parallel – Italian |
Italian |
trial |
52,871,060 |
Europarl spoken parallel – Latvian |
Latvian |
trial |
11,920,085 |
Europarl spoken parallel – Lithuanian |
Lithuanian |
trial |
11,424,032 |
Europarl spoken parallel – Polish |
Polish |
trial |
13,034,164 |
Europarl spoken parallel – Polish |
Polish |
open |
13,034,164 |
Europarl spoken parallel – Portuguese |
Portuguese |
trial |
53,778,766 |
Europarl spoken parallel – Romanian |
Romanian |
trial |
9,554,864 |
Europarl spoken parallel – Slovak |
Slovak |
trial |
12,942,651 |
Europarl spoken parallel – Slovenian |
Slovenian |
trial |
12,496,942 |
Europarl spoken parallel – Spanish |
Spanish |
trial |
54,302,284 |
Europarl spoken parallel – Spanish |
Spanish |
open |
15,513,307 |
Europarl spoken parallel – Swedish |
Swedish |
trial |
46,303,799 |
European Spanish Web 2011 (eseuTenTen11) |
Spanish |
main |
2,021,633,644 |
Film Corpus |
English |
main |
21,661,806 |
Finnish Web 2014 (fiTenTen14) |
Finnish |
trial |
1,404,083,812 |
Finnish Web 2014 (fiTenTen14, TreeTagger v2) |
Finnish |
main |
1,404,100,049 |
Frantext (French literature of the 18th-20th century) |
French |
main |
15,573,070 |
Frantext (French literature of the 18th-20th century), without trends |
French |
main |
15,573,070 |
French corpus of 88,000 SMS (88milSMS) |
French |
trial |
1,206,663 |
French Drama Corpus |
French |
main |
12,822,260 |
French parliamentary debates (ParlaMint 2.1) |
French |
trial |
32,214,147 |
French parliamentary debates (ParlaMint 2.1, CoNLL format) |
French |
trial |
32,176,380 |
French Trends |
French |
trial |
750,441,194 |
French Web 2008 (v2 with lempos) |
French |
main |
104,705,211 |
French Web 2010 (frWaC) |
French |
main |
1,330,564,200 |
French Web 2012 (frTenTen12) |
French |
main |
9,889,689,889 |
French Web 2017 (frTenTen17) |
French |
main |
5,752,261,039 |
French Web 2020 (frTenTen20) |
French |
main |
15,115,914,647 |
French Web 2023 (frTenTen23) |
French |
trial |
23,874,070,858 |
Georgian Web 2013 (kaWaC) |
Georgian |
trial |
50,713,604 |
German Corpus for SkELL 1.0 |
German |
main |
769,810,745 |
German Drama Corpus |
German |
main |
9,374,314 |
German Political Speeches Corpus |
German |
trial |
11,144,258 |
German Trends |
German |
trial |
1,347,685,397 |
German Web 2010 |
German |
main |
2,338,036,362 |
German Web 2010 (deWaC) |
German |
main |
1,348,188,416 |
German Web 2013 (deTenTen13) |
German |
main |
16,526,335,416 |
German Web 2018 (deTenTen18) |
German |
main |
5,346,041,196 |
German Web 2020 (deTenTen20) |
German |
trial |
17,512,733,172 |
GerManC (German Newspapers 1650-1800) |
German |
main |
667,310 |
Gigafida v2.0 (referenčni) |
Slovenian |
main |
1,109,441,592 |
Greek Drama Corpus |
Greek |
main |
269,334 |
Greek Web (GkWaC with lempos) |
Greek |
main |
124,285,612 |
Greek Web 2014 (elTenTen14) |
Greek |
main |
1,671,692,845 |
Greek Web 2019 (elTenTen19) |
Greek |
trial |
2,342,091,029 |
Guangwai - Lancaster Chinese Learner Corpus |
Chinese Simplified |
open |
1,289,060 |
Gujarati Web (guWaC) |
Gujarati |
main |
17,960,095 |
Gujarati Web 2021 (guTenTen21) |
Gujarati |
trial |
88,574,710 |
Gutenberg Afrikaans 2020 |
Afrikaans |
main |
315,010 |
Gutenberg Bulgarian 2020 |
Bulgarian |
main |
33,352 |
Gutenberg Catalan 2020 |
Catalan |
main |
1,320,242 |
Gutenberg Chinese Traditional 2020 |
Chinese Traditional |
main |
27,136,782 |
Gutenberg Czech 2020 |
Czech |
main |
364,683 |
Gutenberg Danish 2020 |
Danish |
main |
3,959,344 |
Gutenberg Dutch 2020 |
Dutch |
main |
87,390,658 |
Gutenberg English 2020 |
English |
main |
2,903,177,585 |
Gutenberg Esperanto 2020 |
Esperanto |
trial |
2,024,013 |
Gutenberg Finnish 2020 |
Finnish |
main |
68,174,366 |
Gutenberg French 2020 |
French |
main |
197,560,500 |
Gutenberg German 2020 |
German |
main |
74,709,930 |
Gutenberg Greek 2020 |
Greek |
main |
7,837,742 |
Gutenberg Hebrew 2020 |
Hebrew |
main |
158,212 |
Gutenberg Hungarian 2020 |
Hungarian |
main |
9,140,833 |
Gutenberg Icelandic 2020 |
Icelandic |
main |
82,211 |
Gutenberg Italian 2020 |
Italian |
main |
93,049,296 |
Gutenberg Japanese 2020 |
Japanese |
main |
963,368 |
Gutenberg Latin 2020 |
Latin |
main |
3,871,335 |
Gutenberg Norwegian Bokmål 2020 |
Norwegian Bokmål |
main |
762,295 |
Gutenberg Polish 2020 |
Polish |
main |
421,318 |
Gutenberg Portuguese 2020 |
Portuguese |
main |
14,309,476 |
Gutenberg Russian 2020 |
Russian |
main |
13,643 |
Gutenberg Serbian 2020 |
Serbian |
main |
70,724 |
Gutenberg Spanish 2020 |
Spanish |
main |
37,202,233 |
Gutenberg Swedish 2020 |
Swedish |
main |
7,919,783 |
Gutenberg Tagalog 2020 |
Tagalog |
main |
2,468,064 |
Gutenberg Telugu 2020 |
Telugu |
main |
157,077 |
Gutenberg Welsh 2020 |
Welsh |
main |
221,733 |
Hausa Web 2015 (hausaWaC15) |
Hausa (Boko) |
trial |
5,304,300 |
Hebrew Drama Corpus |
Hebrew |
main |
954,359 |
Hebrew General Corpus (web crawled, mostly newspapers) |
Hebrew |
main |
157,947,728 |
Hebrew Translation Corpus |
Hebrew |
trial |
1,180,003 |
Hebrew Web (HebWaC) |
Hebrew |
main |
47,832,254 |
Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) |
Hebrew |
ondemand |
895,876,116 |
Hebrew Web 2014 (heTenTen14, no POS tagging) |
Hebrew |
main |
890,282,843 |
Hebrew Web 2021 (heTenTen21) |
Hebrew |
trial |
2,775,686,699 |
Hindi Web 2012 (HindiWaC v. 4) |
Hindi |
trial |
107,960,109 |
Hindi Web 2013 (hiTenTen13) |
Hindi |
main |
351,289,441 |
Hindi Web 2017 (hiTenTen17) |
Hindi |
main |
1,228,379,747 |
Hindi Web 2021 (hiTenTen21) |
Hindi |
trial |
792,395,313 |
Hungarian Drama Corpus |
Hungarian |
main |
533,088 |
Hungarian parliamentary debates (ParlaMint 2.1) |
Hungarian |
trial |
858,844 |
Hungarian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Hungarian |
trial |
856,543 |
Hungarian Web 2012 (huTenTen12) |
Hungarian |
main |
2,572,620,694 |
Hungarian Web 2020 (huTenTen20) |
Hungarian |
main |
5,164,717,029 |
Hungarian Web 2023 (huTenTen23) |
Hungarian |
trial |
3,494,350,960 |
Icelandic Gigaword Corpus 2017 |
Icelandic |
main |
532,028,866 |
Icelandic parliamentary debates (ParlaMint 2.1) |
Icelandic |
trial |
23,468,157 |
Icelandic parliamentary debates (ParlaMint 2.1, CoNLL format) |
Icelandic |
trial |
23,461,109 |
Icelandic texts [sample] |
Icelandic |
trial |
5,436,035 |
Icelandic Web 2020 (isTenTen20) |
Icelandic |
trial |
518,620,759 |
Igbo Web 2015 (IgboWaC15) |
Igbo |
main |
331,042 |
Igbo Web 2017 (igTenTen17) |
Igbo |
trial |
629,294 |
Indonesian Web (IndonesianWaC) |
Indonesian |
trial |
90,120,046 |
Indonesian Web 2020 (idTenTen20) |
Indonesian |
main |
3,687,192,045 |
Indonesian Web 2024 (idTenTen24) |
Indonesian |
trial |
7,108,841,939 |
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) |
Irish |
open |
478,445 |
Irish Trends |
Irish |
trial |
1,622,039 |
Irish Web 2022 (gaTenTen22) |
Irish |
trial |
125,040,541 |
Italian Corpus for SkELL 1.0 |
Italian |
main |
328,270,600 |
Italian Drama Corpus |
Italian |
main |
1,669,717 |
Italian parliamentary debates (ParlaMint 2.1) |
Italian |
trial |
26,549,927 |
Italian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Italian |
trial |
26,571,966 |
Italian Trends (2014–today) |
Italian |
trial |
8,899,799,129 |
Italian Web 2006 (itWaC) |
Italian |
main |
1,597,295,469 |
Italian Web 2010 (itTenTen) |
Italian |
main |
2,588,873,046 |
Italian Web 2016 (itTenTen16) |
Italian |
main |
4,989,729,171 |
Italian Web 2020 (itTenTen20) |
Italian |
trial |
12,451,734,885 |
itWAC (reduced) |
Italian |
main |
751,542,948 |
Japanese Web 2006 (jpWaC) |
Japanese |
main |
336,867,039 |
Japanese Web 2011 (jaTenTen11) |
Japanese |
trial |
8,432,294,787 |
Japanese Web 2011 (jaTenTen11, sample) |
Japanese |
main |
301,407,652 |
Japanese Web 2011 sample (jaTenTen11, LUW) |
Japanese |
trial |
163,837,764 |
Kannada Web 2012 (knWaC12) |
Kannada |
trial |
11,056,526 |
KAS-Dipl (diplome) |
Slovenian |
main |
568,188,810 |
KAS-Dr (doktorati) |
Slovenian |
main |
30,244,519 |
KAS-Mag (magisteriji) |
Slovenian |
main |
157,168,378 |
Khmer Web 2018 (kmTenTen18) |
Khmer |
main |
16,500,379 |
Khmer Web 2021 (kmTenTen21) |
Khmer |
trial |
103,066,083 |
Korean Web 2012 (koTenTen12) |
Korean |
main |
461,196,240 |
Korean Web 2018 (koTenTen18) |
Korean |
trial |
1,668,851,720 |
KSUCCA (Classical Arabic) |
Arabic |
trial |
46,705,577 |
Lao Web 2018 (loTenTen18) |
Lao |
main |
15,862,991 |
Lao Web 2019 (loTenTen19) |
Lao |
trial |
105,018,584 |
LatinISE corpus |
Latin |
trial |
11,202,216 |
Latvian parliamentary debates (ParlaMint 2.1) |
Latvian |
trial |
6,318,701 |
Latvian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Latvian |
trial |
6,342,984 |
Latvian Web (LatvianWaC) |
Latvian |
main |
57,666,024 |
Latvian Web 2014 (lvTenTen14) |
Latvian |
trial |
530,367,474 |
Lektor (Learner corpus of proofreading and translations) |
Slovenian |
main |
953,038 |
LEXMCI |
English |
main |
1,448,180,339 |
Lithuanian parliamentary debates (ParlaMint 2.1) |
Lithuanian |
trial |
14,573,624 |
Lithuanian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Lithuanian |
trial |
14,428,682 |
Lithuanian Web (LithuanianWaC v2) |
Lithuanian |
main |
48,650,918 |
Lithuanian Web 2014 (ltTenTen14) |
Lithuanian |
trial |
778,151,979 |
London English Corpus |
English |
main |
2,391,040 |
MaCoCu Albanian Web v1 (2022) |
Albanian |
main |
617,643,884 |
MaCoCu Bosnian Web v1 (2021-2022) |
Bosnian |
trial |
715,708,157 |
MaCoCu Croatian Web v2 (2021–2022) |
Croatian |
main |
2,299,750,788 |
MaCoCu Macedonian Web v2 (2021) |
Macedonian |
trial |
512,171,886 |
MaCoCu Maltese Web v2 (2021) |
Maltese |
main |
331,665,362 |
MaCoCu Montenegrin Web v1 (2021-2022) |
Montenegrin |
main |
157,680,373 |
MaCoCu Serbian Web v1 (2021-2022) |
Serbian |
main |
2,435,143,021 |
MaCoCu Slovene Web v2 (2021-2022) |
Slovenian |
main |
1,863,942,989 |
MaCoCu Turkish Web v2 (2021) |
Turkish |
trial |
4,261,087,826 |
MaCoCu Ukrainian Web v1 (2021-2022) |
Ukrainian |
main |
5,912,040,719 |
Magpie corpus |
English |
main |
4,597,782 |
Malay Web 2020 (msTenTen20) |
Malay |
trial |
296,419,465 |
Malayalam Web (malayalamWaC) |
Malayalam |
trial |
15,950,663 |
Malaysian Web (MalaysianWaC) |
Malay |
trial |
182,578,743 |
Maldivian Wikipedia corpus 2019 (dvwiki) |
Maldivian |
trial |
548,211 |
Maltese MLRS Corpus |
Maltese |
trial |
110,714,844 |
Maltese Trends |
Maltese |
trial |
6,218,457 |
Maori Web 2013 and 2020 (miTenTen20) |
Maori |
trial |
11,814,825 |
Medical Web Corpus |
English |
main |
33,961,786 |
Merlin Written Learner Czech |
Czech |
main |
75,526 |
Merlin Written Learner German |
German |
main |
150,256 |
Merlin Written Learner Italian |
Italian |
main |
107,797 |
METCLIL: Metaphor in EMI seminars |
English |
open |
110,493 |
Mongolian Web Texts 2016 (mnWaC16) |
Mongolian |
trial |
6,104,565 |
Mueller Report |
English |
trial |
167,103 |
Nepali National Corpus |
Nepali |
trial |
13,440,835 |
Nepali Web (NepaliWaC) |
Nepali |
main |
1,290,388 |
New corpus for English (NCI English) |
English |
main |
216,618,095 |
New Model Corpus |
English |
main |
95,276,958 |
Newspapers in Portuguese (CetemPúblico, CetenFolha) |
Portuguese |
main |
56,768,822 |
Norwegian dictionary corpus (Nynorskkorpuset) |
Norwegian |
main |
74,496,664 |
Norwegian Web 2012 |
Norwegian |
main |
669,511,569 |
Norwegian Web 2017 (noTenTen17, Bokmål and Nynorsk) |
Norwegian |
trial |
2,630,849,803 |
Norwegian Web 2017 (noTenTen17, Bokmål) |
Norwegian Bokmål |
trial |
2,461,704,417 |
Norwegian Web 2017 (noTenTen17, Nynorsk) |
Norwegian Nynorsk |
trial |
169,145,386 |
OEC |
English |
ondemand |
2,073,319,589 |
Old French and Middle French (BFM 2022) |
French |
main |
6,002,552 |
Open American National Corpus (spoken) |
English |
main |
3,202,026 |
Open American National Corpus (written) |
English |
main |
11,048,137 |
Open Cambridge Learner Corpus (Uncoded) |
English |
ondemand |
2,975,701 |
Open Parallel Corpus (OPUS) – Afrikaans |
Afrikaans |
main |
586,334 |
Open Parallel Corpus (OPUS) – Albanian |
Albanian |
main |
46,304,346 |
Open Parallel Corpus (OPUS) – Arabic |
Arabic |
main |
300,000,057 |
Open Parallel Corpus (OPUS) – Bosnian |
Bosnian |
main |
43,582,516 |
Open Parallel Corpus (OPUS) – Bulgarian |
Bulgarian |
main |
183,115,244 |
Open Parallel Corpus (OPUS) – Croatian |
Croatian |
main |
121,369,625 |
Open Parallel Corpus (OPUS) – Czech |
Czech |
main |
203,845,619 |
Open Parallel Corpus (OPUS) – Danish |
Danish |
main |
120,107,271 |
Open Parallel Corpus (OPUS) – Dutch |
Dutch |
main |
356,363,571 |
Open Parallel Corpus (OPUS) – English |
English |
main |
1,139,515,048 |
Open Parallel Corpus (OPUS) – Estonian |
Estonian |
main |
64,879,741 |
Open Parallel Corpus (OPUS) – Finnish |
Finnish |
main |
131,985,872 |
Open Parallel Corpus (OPUS) – French |
French |
main |
766,833,908 |
Open Parallel Corpus (OPUS) – German |
German |
main |
125,229,773 |
Open Parallel Corpus (OPUS) – Greek |
Greek |
main |
239,360,926 |
Open Parallel Corpus (OPUS) – Hebrew |
Hebrew |
main |
130,972,343 |
Open Parallel Corpus (OPUS) – Hindi |
Hindi |
main |
854,741 |
Open Parallel Corpus (OPUS) – Hungarian |
Hungarian |
main |
157,495,018 |
Open Parallel Corpus (OPUS) – Italian |
Italian |
main |
180,532,849 |
Open Parallel Corpus (OPUS) – Japanese |
Japanese |
main |
5,455,106 |
Open Parallel Corpus (OPUS) – Korean |
Korean |
main |
374,850 |
Open Parallel Corpus (OPUS) – Latvian |
Latvian |
main |
24,499,516 |
Open Parallel Corpus (OPUS) – Lithuanian |
Lithuanian |
main |
29,621,940 |
Open Parallel Corpus (OPUS) – Macedonian |
Macedonian |
main |
40,348,792 |
Open Parallel Corpus (OPUS) – Persian |
Persian |
main |
4,425,133 |
Open Parallel Corpus (OPUS) – Polish |
Polish |
main |
208,008,636 |
Open Parallel Corpus (OPUS) – Portuguese |
Portuguese |
main |
297,700,205 |
Open Parallel Corpus (OPUS) – Portuguese |
Portuguese |
main |
272,300,927 |
Open Parallel Corpus (OPUS) – Romanian |
Romanian |
main |
282,408,295 |
Open Parallel Corpus (OPUS) – Russian |
Russian |
main |
307,709,872 |
Open Parallel Corpus (OPUS) – Serbian |
Serbian |
main |
153,237,786 |
Open Parallel Corpus (OPUS) – Slovak |
Slovak |
main |
62,451,407 |
Open Parallel Corpus (OPUS) – Slovenian |
Slovenian |
main |
121,228,966 |
Open Parallel Corpus (OPUS) – Spanish |
Spanish |
main |
701,944,027 |
Open Parallel Corpus (OPUS) – Swedish |
Swedish |
main |
102,298,686 |
Open Parallel Corpus (OPUS) – Turkish |
Turkish |
main |
151,342,424 |
Open Parallel Corpus (OPUS) – Ukrainian |
Ukrainian |
main |
2,577,481 |
Open Parallel Corpus OPUS – Chinese Simplified |
Chinese Simplified |
main |
243,427,123 |
Open Parallel Corpus OPUS – Chinese Traditional |
Chinese Traditional |
main |
380,245 |
Open Parallel Corpus OPUS – Norwegian Bokmål |
Norwegian |
main |
20,237,510 |
OpenSubtitles 2018 parallel – Afrikaans |
Afrikaans |
main |
341,349 |
OpenSubtitles 2018 parallel – Albanian |
Albanian |
main |
15,662,170 |
OpenSubtitles 2018 parallel – Arabic |
Arabic |
main |
333,329,378 |
OpenSubtitles 2018 parallel – Armenian |
Armenian |
main |
24,216 |
OpenSubtitles 2018 parallel – Basque |
Basque |
main |
3,919,829 |
OpenSubtitles 2018 parallel – Bengali |
Bengali |
main |
2,270,841 |
OpenSubtitles 2018 parallel – Bosnian |
Bosnian |
main |
125,323,299 |
OpenSubtitles 2018 parallel – Breton |
Breton |
trial |
85,503 |
OpenSubtitles 2018 parallel – Bulgarian |
Bulgarian |
main |
371,685,493 |
OpenSubtitles 2018 parallel – Catalan |
Catalan |
main |
3,273,561 |
OpenSubtitles 2018 parallel – Chinese Simplified |
Chinese Simplified |
main |
119,998,854 |
OpenSubtitles 2018 parallel – Chinese Traditional |
Chinese Traditional |
main |
41,876,166 |
OpenSubtitles 2018 parallel – Croatian |
Croatian |
main |
370,177,938 |
OpenSubtitles 2018 parallel – Czech |
Czech |
main |
453,218,524 |
OpenSubtitles 2018 parallel – Danish |
Danish |
main |
135,228,416 |
OpenSubtitles 2018 parallel – Dutch |
Dutch |
main |
444,413,064 |
OpenSubtitles 2018 parallel – English |
English |
main |
1,211,666,401 |
OpenSubtitles 2018 parallel – Esperanto |
Esperanto |
main |
396,790 |
OpenSubtitles 2018 parallel – Estonian |
Estonian |
main |
107,391,459 |
OpenSubtitles 2018 parallel – Finnish |
Finnish |
main |
175,247,181 |
OpenSubtitles 2018 parallel – French |
French |
main |
462,749,061 |
OpenSubtitles 2018 parallel – Galician |
Galician |
trial |
1,572,312 |
OpenSubtitles 2018 parallel – Georgian |
Georgian |
main |
1,157,136 |
OpenSubtitles 2018 parallel – German |
German |
main |
185,133,927 |
OpenSubtitles 2018 parallel – Greek |
Greek |
main |
457,347,003 |
OpenSubtitles 2018 parallel – Hebrew |
Hebrew |
main |
371,473,205 |
OpenSubtitles 2018 parallel – Hindi |
Hindi |
main |
675,322 |
OpenSubtitles 2018 parallel – Hungarian |
Hungarian |
main |
378,525,740 |
OpenSubtitles 2018 parallel – Icelandic |
Icelandic |
main |
9,194,074 |
OpenSubtitles 2018 parallel – Indonesian |
Indonesian |
main |
77,273,767 |
OpenSubtitles 2018 parallel – Italian |
Italian |
main |
431,415,848 |
OpenSubtitles 2018 parallel – Japanese |
Japanese |
main |
15,224,480 |
OpenSubtitles 2018 parallel – Kazakh |
Kazakh |
main |
14,172 |
OpenSubtitles 2018 parallel – Korean |
Korean |
main |
7,432,927 |
OpenSubtitles 2018 parallel – Latvian |
Latvian |
main |
2,494,901 |
OpenSubtitles 2018 parallel – Lithuanian |
Lithuanian |
main |
6,806,857 |
OpenSubtitles 2018 parallel – Macedonian |
Macedonian |
main |
28,859,153 |
OpenSubtitles 2018 parallel – Malay |
Malay |
main |
13,465,077 |
OpenSubtitles 2018 parallel – Malayalam |
Malayalam |
main |
1,671,708 |
OpenSubtitles 2018 parallel – Norwegian (Mixed) |
Norwegian |
main |
61,215,172 |
OpenSubtitles 2018 parallel – Persian |
Persian |
main |
53,444,595 |
OpenSubtitles 2018 parallel – Polish |
Polish |
main |
496,167,686 |
OpenSubtitles 2018 parallel – Portuguese |
Portuguese |
main |
545,598,189 |
OpenSubtitles 2018 parallel – Portuguese |
Portuguese |
main |
466,021,603 |
OpenSubtitles 2018 parallel – Romanian |
Romanian |
main |
658,289,867 |
OpenSubtitles 2018 parallel – Russian |
Russian |
main |
180,032,832 |
OpenSubtitles 2018 parallel – Serbian |
Serbian |
main |
480,367,760 |
OpenSubtitles 2018 parallel – Sinhalese |
Sinhalese |
trial |
3,430,727 |
OpenSubtitles 2018 parallel – Slovak |
Slovak |
main |
66,455,056 |
OpenSubtitles 2018 parallel – Slovenian |
Slovenian |
main |
198,366,873 |
OpenSubtitles 2018 parallel – Spanish |
Spanish |
main |
753,235,853 |
OpenSubtitles 2018 parallel – Swedish |
Swedish |
main |
153,717,474 |
OpenSubtitles 2018 parallel – Tagalog |
Tagalog |
main |
96,291 |
OpenSubtitles 2018 parallel – Tamil |
Tamil |
main |
132,055 |
OpenSubtitles 2018 parallel – Telugu |
Telugu |
main |
109,730 |
OpenSubtitles 2018 parallel – Thai |
Thai |
main |
33,223,171 |
OpenSubtitles 2018 parallel – Turkish |
Turkish |
main |
461,809,489 |
OpenSubtitles 2018 parallel – Ukrainian |
Ukrainian |
main |
5,049,602 |
OpenSubtitles 2018 parallel – Urdu |
Urdu |
main |
229,947 |
OpenSubtitles 2018 parallel – Vietnamese |
Vietnamese |
main |
31,848,385 |
OPUS MontenegrinSubs parallel – English |
English |
trial |
468,337 |
OPUS MontenegrinSubs parallel – Montenegrin |
Montenegrin |
trial |
365,698 |
Oromo Web 2016 (orWaC16) |
Oromo |
trial |
4,249,953 |
Oxford Children's Corpus 2015 (PTag) |
English |
ondemand |
210,322,185 |
Oxford Children's Corpus 2015 -- Education (PTag) |
English |
ondemand |
1,323,174 |
Oxford Children's Corpus 2015 -- Reading (PTag) |
English |
ondemand |
34,284,687 |
Oxford Children's Corpus 2015 -- Writing (PTag) |
English |
ondemand |
174,714,324 |
Oxford Children's Corpus 2016 (PTag) |
English |
ondemand |
284,360,063 |
Oxford Children's Corpus 2016 -- Reading (PTag) |
English |
ondemand |
53,858,955 |
Oxford Children's Corpus 2016 -- Writing (PTag) |
English |
ondemand |
229,177,934 |
Oxford Corpus of Academic English (OCAE, April 2012) |
English |
ondemand |
71,371,739 |
Paisa |
Italian |
main |
221,989,288 |
ParlaTalk Austria parliamentary debates (lower house) |
German |
trial |
7,675,413 |
ParlaTalk Austria parliamentary debates (upper house) |
German |
trial |
3,101,421 |
ParlaTalk Belgium parliamentary debates (lower house) |
French |
trial |
58,073,338 |
ParlaTalk Bulgaria parliamentary debates |
Bulgarian |
trial |
15,221,455 |
ParlaTalk Czech Republic parliamentary debates (lower house) |
Czech |
trial |
22,091,557 |
ParlaTalk Czech Republic parliamentary debates (upper house) |
Czech |
trial |
11,737,338 |
ParlaTalk Denmark parliamentary debates |
Danish |
trial |
80,017,714 |
ParlaTalk Estonia parliamentary debates |
Estonian |
trial |
11,665,859 |
ParlaTalk Finland parliamentary debates |
Finnish |
trial |
22,660,060 |
ParlaTalk France parliamentary debates (lower house) |
French |
trial |
61,116,819 |
ParlaTalk France parliamentary debates (upper house) |
French |
trial |
181,508,579 |
ParlaTalk German parliamentary debates (lower house) |
German |
trial |
130,988,058 |
ParlaTalk Greek parliamentary debates |
Greek |
trial |
23,540,099 |
ParlaTalk Hungary parliamentary debates |
Hungarian |
trial |
3,077,151 |
ParlaTalk Ireland parliamentary debates |
English |
trial |
121,302,091 |
ParlaTalk Italy parliamentary debates (lower house) |
Italian |
trial |
7,656,348 |
ParlaTalk Italy parliamentary debates (upper house) |
Italian |
trial |
13,308,453 |
ParlaTalk Netherlands parliamentary debates (lower house) |
Dutch |
trial |
82,035,039 |
ParlaTalk Netherlands parliamentary debates (upper house) |
Dutch |
trial |
12,073,192 |
ParlaTalk Poland parliamentary debates (upper house) |
Polish |
trial |
20,409,110 |
ParlaTalk Portugal parliamentary debates |
Portuguese |
trial |
141,098,975 |
ParlaTalk Romania parliamentary debates (lower house) |
Romanian |
trial |
15,772,145 |
ParlaTalk Romania parliamentary debates (upper house) |
Romanian |
trial |
27,543,309 |
ParlaTalk Slovakia parliamentary debates |
Slovak |
trial |
9,790,175 |
ParlaTalk Slovenia parliamentary debates (lower house) |
Slovenian |
trial |
26,002,443 |
ParlaTalk Spain Republic parliamentary debates (lower house) |
Spanish |
trial |
1,882,700 |
ParlaTalk Sweden parliamentary debates |
Swedish |
trial |
131,739,759 |
Parsed German Web (sDeWaC) |
German |
main |
755,165,551 |
Penn Corpora of Historical English |
English |
ondemand |
3,800,639 |
Persian Trends |
Persian |
trial |
281,903,621 |
PICAE 2010 |
English |
ondemand |
31,025,920 |
Polish Drama Corpus |
Polish |
main |
117,230 |
Polish language of the 1960s |
Polish |
main |
546,042 |
Polish Parliamentary Corpus (PPC) |
Polish |
main |
553,858,723 |
Polish parliamentary debates (ParlaMint 2.1) |
Polish |
trial |
26,619,472 |
Polish parliamentary debates (ParlaMint 2.1, CoNLL format) |
Polish |
trial |
26,882,964 |
Polish Trends |
Polish |
trial |
625,775,055 |
Polish Web (PolishWac, Morfeusz and TaKIPI tagger) |
Polish |
main |
103,028,410 |
Polish Web 2012 (plTenTen12, RFTagger) |
Polish |
main |
7,715,835,214 |
Polish Web 2012 sample (plTenTen12) |
Polish |
main |
45,208,497 |
Polish Web 2019 (plTenTen19) |
Polish |
trial |
4,253,636,443 |
Polish Web 2019 term reference (plTenTen19_01) |
Polish |
trial |
181,036,098 |
Portuguese Trends |
Portuguese |
trial |
681,684,349 |
Portuguese Web 2011 (ptTenTen11) |
Portuguese |
main |
3,896,392,719 |
Portuguese Web 2011 (ptTenTen11, Palavras parsed) |
Portuguese |
main |
2,757,635,105 |
Portuguese Web 2018 (ptTenTen18) |
Portuguese |
trial |
7,407,393,731 |
Portuguese Web 2023 (ptTenTen23) |
Portuguese |
trial |
16,976,742,883 |
Project Gutenberg English |
English |
main |
443,471,071 |
pukWaC (ukWaC parsed with MaltParser) |
English |
main |
39,496,785 |
Quran annotated corpus [unvowelled Arabic] |
Arabic |
main |
128,243 |
Quran annotated corpus [unvowelled Latin] |
Arabic |
main |
99,268 |
Quran annotated corpus [vowelled Arabic] |
Arabic |
main |
128,241 |
Quran annotated corpus [vowelled Latin] |
Arabic |
main |
97,970 |
RapCor1360 - Francophone rap songs |
French |
trial |
735,513 |
Riznica v0.1 |
Croatian |
main |
85,273,724 |
Roman Drama Corpus |
Latin |
main |
278,890 |
Romanian Web 2016 (roTenTen16) |
Romanian |
main |
2,640,496,763 |
Romanian Web 2021 (roTenTen21) |
Romanian |
trial |
2,763,173,824 |
ruSkELL 1.6 |
Russian |
main |
975,584,449 |
Russian Drama Corpus |
Russian |
main |
2,011,699 |
Russian Sites in Estonian Web 2017–2023 |
Russian |
main |
312,244,562 |
Russian Trends |
Russian |
trial |
1,436,217,893 |
Russian Web 2006 (v2 with lempos) |
Russian |
main |
147,930,261 |
Russian Web 2011 (ruTenTen11) |
Russian |
trial |
14,553,856,113 |
Russian Web 2017 (ruTenTen17) |
Russian |
trial |
9,034,837,939 |
Samoan Web (SamoanWac1) |
Samoan |
trial |
3,115,385 |
Santa Barbara Corpus of Spoken American English |
English |
main |
249,655 |
ScienceBlogs |
English |
main |
103,175,233 |
Scottish Gaelic Wiki 2015 (gdWiki) |
Scottish Gaelic |
trial |
980,026 |
Semcor v3.0 (sense-tagged corpus) |
English |
main |
664,038 |
Serbian Web (srWaC 1.2 processed by Hunpos) |
Serbian |
trial |
477,724,164 |
Serbian Web (srWaC 1.2 processed by RFTagger v1) |
Serbian (Latin) |
trial |
441,888,202 |
Serbian Web (srWaC 1.2) |
Serbian (Latin) |
trial |
476,888,297 |
Setswana/Tswana Web (SetswanaWaC v2) |
Setswana |
trial |
11,496,687 |
Shakespeare English Drama Corpus |
English |
main |
810,929 |
Shakespeare German Drama Corpus |
German |
main |
796,439 |
Slovak Trends |
Slovak |
trial |
179,768,140 |
Slovak Web 2011 (skTenTen11) |
Slovak |
main |
540,112,634 |
Slovak Web 2011 (skTenTen11, ambiguity tag, lempos) |
Slovak |
main |
715,707,053 |
Slovak Web 2023 (skTenTen23) |
Slovak |
trial |
898,031,101 |
Slovene Trends |
Slovenian |
trial |
106,489,718 |
Slovenian parliamentary debates (ParlaMint 2.1) |
Slovenian |
trial |
19,933,512 |
Slovenian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Slovenian |
trial |
19,933,836 |
Slovenian reference corpus (FidaPLUS v2) |
Slovenian |
trial |
600,309,637 |
Slovenian Web (slWaC 2.1) |
Slovenian |
trial |
754,255,589 |
Slovenian Web (slWaC 2.1, processed with TreeTagger version 2) |
Slovenian |
trial |
755,255,547 |
Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
Slovenian |
trial |
829,544,337 |
Somali Web 2016 (soWaC16) |
Somali |
trial |
71,871,585 |
SoNaR |
Dutch |
ondemand |
425,978,755 |
Sorani Kurdish Wikipedia corpus 2020 (ckbwiki20) |
Kurdish (Sorani) |
trial |
5,042,449 |
Spanish Calderon Drama Corpus |
Spanish |
main |
2,112,643 |
Spanish Drama Corpus |
Spanish |
main |
371,624 |
Spanish parliamentary debates (ParlaMint 2.1) |
Spanish |
trial |
12,875,498 |
Spanish parliamentary debates (ParlaMint 2.1, CoNLL format) |
Spanish |
trial |
12,930,870 |
Spanish Trends |
Spanish |
trial |
1,320,334,777 |
Spanish Web 2005 (SpanishWaC) |
Spanish |
main |
97,773,185 |
Spanish Web 2011 (esTenTen11, Eu + Am) |
Spanish |
main |
9,497,213,009 |
Spanish Web 2018 (esTenTen18) |
Spanish |
trial |
16,953,735,742 |
Susanne |
English |
trial |
128,998 |
Swahili Web 2014 (swWaC) |
Swahili |
trial |
17,882,483 |
Swedish Drama Corpus |
Swedish |
main |
581,524 |
Swedish Parole |
Swedish |
main |
21,735,113 |
Swedish Web 2014 (svTenTen14) |
Swedish |
trial |
3,401,035,817 |
Tagalog (Filipino) Web 2019 (tlTenTen19) |
Tagalog |
trial |
198,303,250 |
Tajik Web (TajikWaC) |
Tajik |
trial |
93,151,897 |
TalkBank Persian (blog posts) |
Persian |
trial |
269,753,238 |
Tamil Web 2015 (TamilWaC) |
Tamil |
main |
26,750,515 |
Tamil Web 2021 (taTenTen21) |
Tamil |
trial |
823,837,031 |
Tatar Drama Corpus |
Turkish |
main |
10,595 |
Tatar Mixed Corpus |
Tatar |
trial |
102,779,803 |
Tatar News (2000–2014) |
Tatar |
main |
24,927,439 |
Tatar Web 2015 sample |
Tatar |
trial |
195,901 |
Telugu Web 2017 (teTenTen) |
Telugu |
trial |
126,807,158 |
Terms of Service (English) |
English |
open |
168,199 |
Thai Web (ThaiWaC) |
Thai |
trial |
82,787,119 |
Thai Web 2018 (thTenTen18) |
Thai |
trial |
640,530,227 |
The Annotated Corpus of Classical Tibetan (ACTib 2.0) |
Tibetan |
trial |
170,202,078 |
The Digital Corpus of Sanskrit (2010 – 2019) |
Sanskrit (romanised) |
trial |
3,361,394 |
The Digital Parisian Stage Corpus |
French |
main |
172,202 |
The New Corpus for Ireland |
Irish |
main |
29,886,201 |
Tigrinya Web 2016 (tiWaC16) |
Tigrinya |
trial |
2,087,613 |
Timestamped JSI web corpus 2014-2016 Catalan |
Catalan |
trial |
99,395,494 |
Timestamped JSI web corpus 2014-2016 Finnish |
Finnish |
trial |
119,109,490 |
Timestamped JSI web corpus 2014-2016 French |
French |
trial |
1,870,341,756 |
Timestamped JSI web corpus 2014-2016 German |
German |
trial |
1,987,759,563 |
Timestamped JSI web corpus 2014-2016 Hebrew |
Hebrew |
trial |
111,339,363 |
Timestamped JSI web corpus 2014-2016 Hungarian |
Hungarian |
trial |
180,843,359 |
Timestamped JSI web corpus 2014-2016 Korean |
Korean |
trial |
438,816,127 |
Timestamped JSI web corpus 2014-2016 Polish |
Polish |
trial |
157,930,228 |
Timestamped JSI web corpus 2014-2016 Portuguese |
Portuguese |
trial |
1,109,771,393 |
Timestamped JSI web corpus 2014-2016 Russian |
Russian |
trial |
1,120,731,416 |
Timestamped JSI web corpus 2014-2016 Serbian |
Serbian |
trial |
86,380,673 |
Timestamped JSI web corpus 2014-2016 Spanish |
Spanish |
trial |
4,055,944,612 |
Timestamped JSI web corpus 2014-2016 Swedish |
Swedish |
trial |
335,782,681 |
Timestamped JSI web corpus 2014-2021 Catalan |
Catalan |
main |
449,634,119 |
Timestamped JSI web corpus 2014-2021 Finnish |
Finnish |
main |
421,879,841 |
Timestamped JSI web corpus 2014-2021 French |
French |
main |
6,998,186,326 |
Timestamped JSI web corpus 2014-2021 German |
German |
main |
7,055,641,455 |
Timestamped JSI web corpus 2014-2021 Hebrew |
Hebrew |
main |
466,851,576 |
Timestamped JSI web corpus 2014-2021 Hungarian |
Hungarian |
main |
903,862,798 |
Timestamped JSI web corpus 2014-2021 Korean |
Korean |
main |
1,576,995,357 |
Timestamped JSI web corpus 2014-2021 Polish |
Polish |
main |
973,863,152 |
Timestamped JSI web corpus 2014-2021 Portuguese |
Portuguese |
main |
4,685,199,909 |
Timestamped JSI web corpus 2014-2021 Russian |
Russian |
main |
5,788,590,952 |
Timestamped JSI web corpus 2014-2021 Serbian |
Serbian |
main |
565,311,513 |
Timestamped JSI web corpus 2014-2021 Spanish |
Spanish |
main |
16,358,148,966 |
Timestamped JSI web corpus 2014-2021 Swedish |
Swedish |
main |
1,162,692,802 |
Timestamped JSI web corpus 2014-2022 Estonian |
Estonian |
main |
270,502,859 |
Timestamped JSI web corpus 2021-03 Catalan |
Catalan |
main |
12,107,597 |
Timestamped JSI web corpus 2021-03 Czech |
Czech |
main |
20,431,801 |
Timestamped JSI web corpus 2021-03 Finnish |
Finnish |
main |
6,154,402 |
Timestamped JSI web corpus 2021-03 French |
French |
main |
145,384,862 |
Timestamped JSI web corpus 2021-03 German |
German |
main |
126,775,824 |
Timestamped JSI web corpus 2021-03 Hebrew |
Hebrew |
main |
8,450,710 |
Timestamped JSI web corpus 2021-03 Hungarian |
Hungarian |
main |
30,439,114 |
Timestamped JSI web corpus 2021-03 Italian |
Italian |
main |
365,307,999 |
Timestamped JSI web corpus 2021-03 Korean |
Korean |
main |
19,324,576 |
Timestamped JSI web corpus 2021-03 Polish |
Polish |
main |
38,911,481 |
Timestamped JSI web corpus 2021-03 Portuguese |
Portuguese |
main |
108,540,406 |
Timestamped JSI web corpus 2021-03 Russian |
Russian |
main |
150,971,438 |
Timestamped JSI web corpus 2021-03 Serbian |
Serbian |
main |
15,122,285 |
Timestamped JSI web corpus 2021-03 Spanish |
Spanish |
main |
373,185,400 |
Timestamped JSI web corpus 2021-03 Swedish |
Swedish |
main |
22,715,935 |
Timestamped JSI web corpus 2021-04 Catalan |
Catalan |
main |
8,926,986 |
Timestamped JSI web corpus 2021-04 Czech |
Czech |
main |
15,095,366 |
Timestamped JSI web corpus 2021-04 Finnish |
Finnish |
main |
5,624,514 |
Timestamped JSI web corpus 2021-04 French |
French |
main |
113,581,013 |
Timestamped JSI web corpus 2021-04 German |
German |
main |
89,579,085 |
Timestamped JSI web corpus 2021-04 Hebrew |
Hebrew |
main |
6,544,178 |
Timestamped JSI web corpus 2021-04 Hungarian |
Hungarian |
main |
23,392,828 |
Timestamped JSI web corpus 2021-04 Italian |
Italian |
main |
261,813,779 |
Timestamped JSI web corpus 2021-04 Korean |
Korean |
main |
15,506,235 |
Timestamped JSI web corpus 2021-04 Polish |
Polish |
main |
28,676,001 |
Timestamped JSI web corpus 2021-04 Portuguese |
Portuguese |
main |
85,486,841 |
Timestamped JSI web corpus 2021-04 Russian |
Russian |
main |
117,645,204 |
Timestamped JSI web corpus 2021-04 Serbian |
Serbian |
main |
12,237,307 |
Timestamped JSI web corpus 2021-04 Spanish |
Spanish |
main |
289,923,417 |
Timestamped JSI web corpus 2021-04 Swedish |
Swedish |
main |
16,876,787 |
Timestamped JSI web corpus 2021-2022 Ukrainian |
Ukrainian |
main |
199,135,032 |
Timestamped JSI web corpus 2021-22 Spanish |
Spanish |
main |
5,869,620,451 |
Toxicity Corpus |
English |
main |
102,132,547 |
Transhistorical Corpus of Written English (TCWE) |
English |
open |
501,633 |
Turkic web – Azerbaijani |
Azerbaijani |
trial |
94,267,206 |
Turkic web – Kazakh |
Kazakh |
trial |
139,417,763 |
Turkic web – Kyrgyz |
Kyrgyz |
trial |
19,369,507 |
Turkic web – Turkmen |
Turkmen |
trial |
2,105,359 |
Turkic web – Uzbek |
Uzbek |
trial |
18,720,334 |
Turkish parliamentary debates (ParlaMint 2.1) |
Turkish |
trial |
40,873,301 |
Turkish parliamentary debates (ParlaMint 2.1, CoNLL format) |
Turkish |
trial |
42,913,306 |
Turkish Web (trWaC) |
Turkish |
main |
32,791,491 |
Turkish Web 2012 (trTenTen12) |
Turkish |
main |
3,388,418,900 |
Turkish Web 2020 (trTenTen20) |
Turkish |
trial |
4,980,168,485 |
Ukrainian Drama Corpus |
Ukrainian |
main |
322,441 |
Ukrainian Trends |
Ukrainian |
trial |
668,811,511 |
Ukrainian Web 2014 (ukTenTen14) |
Ukrainian |
main |
2,194,447,594 |
Ukrainian Web 2020 and 2014 (ukTenTen20) |
Ukrainian |
main |
2,592,516,436 |
Ukrainian Web 2022 (ukTenTen22) |
Ukrainian |
trial |
7,594,784,148 |
UKWaC super sensed |
English |
main |
315,402,632 |
United Nations Parallel Corpus (UNPC) – Arabic |
Arabic |
trial |
545,594,235 |
United Nations Parallel Corpus (UNPC) – Chinese |
Chinese Simplified |
trial |
372,004,482 |
United Nations Parallel Corpus (UNPC) – English |
English |
trial |
664,924,245 |
United Nations Parallel Corpus (UNPC) – French |
French |
trial |
800,980,141 |
United Nations Parallel Corpus (UNPC) – Russian |
Russian |
trial |
529,667,487 |
United Nations Parallel Corpus (UNPC) – Spanish |
Spanish |
trial |
692,809,915 |
Urdu Web (UrduWaC) |
Urdu |
main |
53,269,273 |
Urdu Web 2018 (urTenTen18) |
Urdu |
trial |
245,656,128 |
Vietnamese Web (viWaC) |
Vietnamese |
trial |
106,664,817 |
Vietnamese Web 2017 (viTenTen17) |
Vietnamese |
trial |
6,056,899,600 |
Welsh Web 2013 (WelshWaC) |
Welsh |
trial |
12,458,397 |
Welsh web corpus |
Welsh |
main |
50,392,441 |
Western Frisian Web 2013 (FrisianWaC) |
Frisian |
trial |
3,116,119 |
Western Punjabi Web 2017 in Shahmukhi script (pnbTenTen17) |
Punjabi (Gurmukhi) |
trial |
2,806,904 |
Yiddish Drama Corpus |
Yiddish |
main |
51,351 |
Yiddish Wikipedia corpus 2018 (yiwiki) |
Yiddish |
trial |
2,106,912 |
Yoruba Web 2015 (YorubaWaC15) |
Yoruba |
trial |
2,816,965 |