[DEV] Estonian RSS Feed Corpus (Filosoft v2) |
Estonian |
main |
75,647,225 |
[DEV] Timestamped JSI web corpus 2014-2020 Estonian |
Estonian |
main |
212,608,965 |
ACL Anthology Reference Corpus (ARC) |
English |
open |
62,196,334 |
Afrikaans Wikipedia corpus 2018 (afwiki) |
Afrikaans |
trial |
14,466,792 |
American Spanish Web 2011 (esamTenTen11) |
Spanish |
trial |
7,475,579,365 |
Amharic Web 2013-17 (amWaC17) |
Amharic |
trial |
25,975,846 |
ArabCC – Learner Corpus of English Essays |
English |
main |
202,364 |
Arabic Learner Corpus (ALC) |
Arabic |
main |
362,712 |
Arabic Web 2009 |
Arabic |
main |
150,282,522 |
Arabic Web 2012 (arTenTen12, Stanford tagger) |
Arabic |
trial |
7,475,624,779 |
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) |
Arabic |
main |
115,315,274 |
Araneum Anglicum Africanum Maius [2015] |
English |
main |
854,484,093 |
Araneum Anglicum Asiaticum Maius [2015] |
English |
main |
867,259,037 |
Araneum Anglicum Maius [2015] |
English |
trial |
888,466,066 |
Araneum Finnicum Maius [2014] |
Finnish |
main |
817,453,523 |
Araneum Francogallicum Maius [2015] |
French |
main |
933,688,995 |
Araneum Germanicum Maius [2013] |
German |
main |
875,465,845 |
Araneum Hispanicum Maius [2013] |
Spanish |
main |
892,299,770 |
Araneum Hungaricum Maius [2014] |
Hungarian |
trial |
792,549,686 |
Araneum Italicum Maius (Italian, 14.12) 1,20 G |
Italian |
main |
890,568,531 |
Araneum Nederlandicum Maius [2013] |
Dutch |
main |
713,417,518 |
Araneum Polonicum Maius [2013] |
Polish |
main |
595,768,667 |
Araneum Portugallicum Maius [2015] |
Portuguese |
main |
862,134,902 |
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G |
Russian |
trial |
859,319,823 |
Araneum Slovacum Maius [2013] |
Slovak |
trial |
816,125,010 |
Basque Web (BasqueWaC v2) |
Basque |
trial |
99,719,584 |
Belarusian Web 2016 (beTenTen16) |
Belarusian |
trial |
63,327,264 |
Bengali Web (bnWaC) |
Bengali |
trial |
11,519,730 |
BIBLE Polish-Swahili |
Polish |
main |
138,216 |
BIBLE Swahili-Polish |
Swahili |
main |
139,160 |
Boot Camp English |
English |
trial |
85,683,246 |
Bosnian Web (bsWaC 1.2) |
Bosnian |
trial |
248,478,730 |
Brazilian Portuguese corpus (Corpus Brasileiro) |
Portuguese |
main |
871,117,178 |
Brexit corpus (English) |
English |
trial |
108,452,923 |
Brexit corpus without retweets (English) |
English |
trial |
4,789,571 |
British Academic Spoken English Corpus (BASE) |
English |
open |
1,477,281 |
British Academic Written English Corpus (BAWE) |
English |
open |
6,968,089 |
British Law Report Corpus |
English |
main |
8,515,749 |
British National Corpus (BNC) |
English |
trial |
96,134,547 |
British National Corpus (BNC) 2014 Spoken |
English |
trial |
10,495,185 |
British National Corpus (BNC), tagged by CLAWS |
English |
trial |
96,052,598 |
British Web 2007 (ukWaC) |
English |
main |
1,313,058,436 |
Brown |
English |
open |
1,007,299 |
Brown Family |
English |
main |
6,963,778 |
Brown Family, CLAWS + TreeTagger tags |
English |
main |
6,975,474 |
Bulgarian National Corpus (BulgarianNC) |
Bulgarian |
main |
20,975,703 |
Bulgarian National Corpus nonweb genres |
Bulgarian |
main |
22,398,507 |
Bulgarian National Corpus with web |
Bulgarian |
main |
419,512,059 |
Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) |
Bulgarian |
trial |
705,156,683 |
Burmese Web 2021 (myTenTen21) |
Burmese |
trial |
557,329,406 |
Cambridge Academic English |
English |
main |
3,163,648 |
Cantonese Web (CantoneseWaC) |
Cantonese |
trial |
30,898,663 |
Catalan Web 2014 (caTenTen14) |
Catalan |
trial |
182,608,420 |
Cebuano Web 2018 (cebTenTen18) |
Cebuano |
trial |
4,552,105 |
CELEN: Learner Corpus of Spanish in Japan |
Spanish |
open |
389,414 |
CHILDES Afrikaans Corpus |
Afrikaans |
main |
26,020 |
CHILDES Catalan Corpus |
Catalan |
main |
209,525 |
CHILDES Croatian Corpus |
Croatian |
main |
300,832 |
CHILDES Danish Corpus |
Danish |
main |
285,231 |
CHILDES English Corpus |
English |
main |
22,693,506 |
CHILDES Estonian Corpus |
Estonian |
main |
313,457 |
CHILDES Farsi Corpus |
Persian |
main |
120,527 |
CHILDES French Corpus |
French |
main |
2,583,460 |
CHILDES Gaelic Corpus |
Irish |
main |
16,848 |
CHILDES German Corpus |
German |
main |
5,941,266 |
CHILDES Hebrew Corpus |
Hebrew |
main |
807,657 |
CHILDES Hungarian Corpus |
Hungarian |
main |
247,881 |
CHILDES Italian Corpus |
Italian |
main |
459,881 |
CHILDES Japanese Corpus |
Japanese |
main |
1,578,068 |
CHILDES Korean Corpus |
Korean |
main |
36,056 |
CHILDES Norwegian Corpus |
Norwegian |
main |
56,827 |
CHILDES Polish Corpus |
Polish |
main |
1,041,300 |
CHILDES Portuguese Corpus |
Portuguese |
main |
216,407 |
CHILDES Russian Corpus |
Russian |
main |
48,791 |
CHILDES Spanish Corpus |
Spanish |
main |
802,743 |
CHILDES Swedish Corpus |
Swedish |
main |
520,478 |
CHILDES Tamil Corpus |
Tamil |
main |
15,490 |
CHILDES Thai Corpus |
Thai |
main |
243,939 |
CHILDES Turkish Corpus |
Turkish |
main |
178,100 |
Chinese GigaWord 2 Corpus: Mainland, simplified |
Chinese Simplified |
main |
205,031,379 |
Chinese GigaWord 2 Corpus: Taiwan, traditional |
Chinese Traditional |
main |
382,600,557 |
Chinese Simplified Web 2017 sample |
Chinese Simplified |
trial |
250,361,047 |
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) |
Chinese Traditional |
main |
259,156,002 |
Chinese Traditional Web 2011 (TaiwanWaC) |
Chinese Traditional |
main |
259,156,002 |
Chinese Traditional Web 2017 (zhTenTen17) sample |
Chinese Traditional |
trial |
239,882,651 |
Chinese Web 2005 (Internet-ZH, NEUCSP tagger) |
Chinese Simplified |
main |
198,205,344 |
Chinese Web 2011 (zhTenTen11, sample 10M) |
Chinese Simplified |
main |
9,012,125 |
Chinese Web 2011 (zhTenTen11, Stanford tagger) |
Chinese Simplified |
trial |
1,729,867,455 |
Chinese Web 2017 (zhTenTen17) Simplified |
Chinese Simplified |
trial |
13,531,331,169 |
Chinese Web 2017 (zhTenTen17) Traditional |
Chinese Traditional |
trial |
2,400,405,372 |
COMPAS 2015 |
English |
access on demand |
114,967,191 |
COMPAS 2016 |
English |
access on demand |
260,896,404 |
CoPEP - The Corpus of Portuguese from Academic Journals (v. 1.4) |
Portuguese |
main |
40,423,011 |
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ |
N'Ko |
open |
4,102,593 |
Corpus of Academic Journal Articles (CAJA) |
English |
access on demand |
78,970,299 |
Corpus of English Dialogues 1560–1760 |
English |
access on demand |
1,151,171 |
Corpus of Estonian Web sentences 2020 |
Estonian |
main |
280,961,465 |
Corpus of Estonian Web sentences 2021 |
Estonian |
main |
473,455,876 |
Covid-19 |
English |
open |
224,061,570 |
Croatian Web (hrWaC 2.2, ReLDI) |
Croatian |
trial |
1,210,021,198 |
Croatian Web (hrWaC 2.2, RFTagger) |
Croatian |
trial |
1,211,328,660 |
csSkELL v1 (whole documents) |
Czech |
main |
1,717,516,129 |
csSkELL v2.2 (sentences with GDEX scores) |
Czech |
main |
1,443,410,941 |
Cundeelee Wangka Stories (Cundeelee Wangka) |
Cundeelee Wangka |
access on demand |
1,965 |
Cundeelee Wangka Stories (English) |
English |
access on demand |
4,423 |
Czech news and web 1995–2002 (czes2.2) |
Czech |
main |
366,796,757 |
Czech Web 2017 (csTenTen17) |
Czech |
trial |
10,502,222,474 |
Czech Web 2017 sample |
Czech |
trial |
249,877,322 |
CzechParl 2012 (v2 with lempos) |
Czech |
main |
37,184,025 |
Danish Gigaword (DAGW) |
Danish |
trial |
964,617,784 |
Danish Web 2010 (DanishWaC) |
Danish |
main |
288,272,967 |
Danish Web 2014 (daTenTen14) |
Danish |
main |
2,040,976,501 |
Danish Web 2017 (daTenTen17) |
Danish |
main |
1,956,590,663 |
Danish Web 2017 sample |
Danish |
main |
214,447,970 |
Danish Web 2020 (daTenTen20) |
Danish |
trial |
3,480,275,804 |
DGT, Bulgarian |
Bulgarian |
main |
25,912,721 |
DGT, Croatian |
Croatian |
main |
3,968,608 |
DGT, Czech |
Czech |
main |
43,621,933 |
DGT, Danish |
Danish |
main |
44,962,280 |
DGT, Dutch |
Dutch |
main |
50,523,892 |
DGT, English |
English |
main |
59,106,576 |
DGT, Estonian |
Estonian |
main |
34,155,488 |
DGT, Finnish |
Finnish |
main |
35,129,923 |
DGT, French |
French |
main |
58,224,781 |
DGT, German |
German |
main |
45,380,666 |
DGT, Greek |
Greek |
main |
51,865,988 |
DGT, Hungarian |
Hungarian |
main |
2,306,272 |
DGT, Irish |
Irish |
main |
1,065,421 |
DGT, Italian |
Italian |
main |
53,260,912 |
DGT, Latvian |
Latvian |
main |
38,898,134 |
DGT, Lithuanian |
Lithuanian |
main |
38,675,242 |
DGT, Maltese |
Maltese |
main |
22,388,562 |
DGT, Polish |
Polish |
main |
44,149,107 |
DGT, Portuguese |
Portuguese |
main |
53,950,705 |
DGT, Romanian |
Romanian |
main |
26,644,734 |
DGT, Slovak |
Slovak |
main |
43,276,048 |
DGT, Slovenian |
Slovenian |
main |
42,897,385 |
DGT, Spanish |
Spanish |
main |
57,311,149 |
DGT, Swedish |
Swedish |
main |
44,378,725 |
Dutch Web 2014 (nlTenTen14) |
Dutch |
trial |
2,253,777,579 |
Dutch Web 2014 sample |
Dutch |
trial |
250,219,005 |
e-flux (International art English) |
English |
main |
5,036,119 |
EcoLexicon English (Environment) |
English |
open |
23,169,446 |
ELEXIS Bulgarian Web 2021 (bgTenTen21) WSD sample |
Bulgarian |
main |
1,992,046 |
ELEXIS Croatian Web 2020 |
Croatian |
main |
1,006,040,496 |
ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample |
Croatian |
main |
1,964,238 |
ELEXIS Czech Web 2019 |
Czech |
main |
949,730,627 |
ELEXIS Czech Web 2019 (csTenTen19) WSD sample |
Czech |
main |
1,970,054 |
ELEXIS Danish Web 2020 |
Danish |
main |
989,769,308 |
ELEXIS Danish Web 2020 (daTenTen20) WSD sample |
Danish |
main |
1,982,549 |
ELEXIS Dutch Web 2020 |
Dutch |
main |
1,024,660,354 |
ELEXIS Dutch Web 2020 (nlTenTen20) WSD sample |
Dutch |
main |
1,982,397 |
ELEXIS English Web 2020 |
English |
main |
1,000,329,442 |
ELEXIS English Web 2020 (enTenTen20, no genres and topics) WSD sample |
English |
main |
1,999,789 |
ELEXIS Estonian Web 2021 |
Estonian |
main |
1,006,940,696 |
ELEXIS Estonian Web 2021 (etTenTen21) WSD sample |
Estonian |
main |
1,995,380 |
ELEXIS Finnish Web 2019 (fiTenTen19) WSD sample |
Finnish |
main |
1,993,821 |
ELEXIS French Web 2020 |
French |
main |
1,069,392,783 |
ELEXIS French Web 2020 (frTenTen20) WSD sample |
French |
main |
2,099,651 |
ELEXIS German Web 2020 |
German |
main |
1,023,830,342 |
ELEXIS German Web 2020 (deTenTen20) WSD sample |
German |
main |
1,998,166 |
ELEXIS Greek Web 2019 (elTenTen19) WSD sample |
Greek |
main |
1,961,351 |
ELEXIS Hebrew Web 2021 |
Hebrew |
main |
1,043,504,840 |
ELEXIS Hebrew Web 2021 (heTenTen21) WSD sample |
Hebrew |
main |
2,017,821 |
ELEXIS Hungarian Web 2020 |
Hungarian |
main |
994,806,145 |
ELEXIS Hungarian Web 2020 (huTenTen20) WSD sample |
Hungarian |
main |
1,989,855 |
ELEXIS Irish Web 2021 |
Irish |
main |
58,130,702 |
ELEXIS Irish Web 2021 (gaTenTen21) WSD sample |
Irish |
main |
1,980,914 |
ELEXIS Italian Web 2020 |
Italian |
main |
1,020,349,212 |
ELEXIS Italian Web 2020 (itTenTen20) WSD sample |
Italian |
main |
1,996,623 |
ELEXIS Latvian Web 2021 |
Latvian |
main |
1,029,262,793 |
ELEXIS Latvian Web 2021 (lvTenTen21) WSD sample |
Latvian |
main |
2,006,576 |
ELEXIS Lithuanian Web 2021 |
Lithuanian |
main |
846,563,251 |
ELEXIS Lithuanian Web 2021 (ltTenTen21) WSD sample |
Lithuanian |
main |
2,004,075 |
ELEXIS Polish Web 2019 |
Polish |
main |
987,945,132 |
ELEXIS Polish Web 2019 (plTenTen19) WSD sample |
Polish |
main |
1,971,906 |
ELEXIS Portuguese Web 2020 |
Portuguese |
main |
1,021,937,614 |
ELEXIS Portuguese Web 2020 (ptTenTen20) WSD sample |
Portuguese |
main |
1,997,515 |
ELEXIS Romanian Web 2021 (roTenTen21) WSD sample |
Romanian |
main |
1,968,801 |
ELEXIS Slovak Web 2021 |
Slovak |
main |
1,008,238,227 |
ELEXIS Slovak Web 2021 (skTenTen21) WSD sample |
Slovak |
main |
1,975,380 |
ELEXIS Slovene Web 2020 (slTenTen20) WSD sample |
Slovenian |
main |
1,964,284 |
ELEXIS Slovenian Web 2020 |
Slovenian |
main |
1,007,206,400 |
ELEXIS Spanish Web 2020 |
Spanish |
main |
1,012,502,656 |
ELEXIS Spanish Web 2020 (esTenTen20) WSD sample |
Spanish |
main |
1,988,999 |
ELEXIS Swedish Web 2020 |
Swedish |
main |
1,006,477,461 |
ELEXIS Swedish Web 2020 (svTenTen20) WSD sample |
Swedish |
main |
1,980,144 |
Elsevier OA CC-BY Corpus |
English |
main |
187,615,459 |
English Broadsheet Newspapers 1993–2013 (SiBol with trends) |
English |
main |
654,435,535 |
English Corpus for SkELL 3.10 |
English |
main |
1,038,200,313 |
English Corpus for SkELL 3.8 |
English |
main |
1,041,772,774 |
English Corpus for SkELL 3.9 |
English |
main |
1,041,138,575 |
English Historical Book Collection (EEBO, ECCO, Evans) |
English |
main |
826,296,048 |
English Preposition Corpus |
English |
trial |
2,136,325 |
English Trends |
English |
trial |
975,908,936 |
English Web 2008 (enTenTen08) |
English |
main |
2,759,340,513 |
English Web 2012 (enTenTen12) |
English |
main |
11,191,860,036 |
English Web 2013 (enTenTen13) |
English |
main |
19,685,733,337 |
English Web 2013 sample |
English |
main |
204,976,089 |
English Web 2015 (enTenTen15) |
English |
trial |
13,190,556,334 |
English Web 2018 (enTenTen18) |
English |
trial |
21,926,740,748 |
English Web 2020 (enTenTen20) |
English |
trial |
36,561,273,153 |
English Wikipedia |
English |
main |
1,356,523,079 |
English Wikipedia sample with Error annotations |
English |
trial |
951,824 |
Estonian Corpus for Learners 2020 (etSkELL) |
Estonian |
main |
280,572,215 |
Estonian coursebook corpus 2018 |
Estonian |
main |
121,114 |
Estonian National Corpus 2019 (Estonian NC 2019) |
Estonian |
main |
1,500,284,681 |
Estonian National Corpus 2021 (Estonian NC 2021) |
Estonian |
main |
2,410,296,919 |
Estonian Web 2017 (etTenTen17) |
Estonian |
main |
658,558,136 |
Estonian Web 2019 (etTenTen19) |
Estonian |
main |
508,447,009 |
Estonian Web 2021 (etTenTen21) |
Estonian |
trial |
725,832,092 |
EUR-Lex Bulgarian 2/2016 |
Bulgarian |
trial |
329,071,554 |
EUR-Lex Croatian 2/2016 |
Croatian |
trial |
109,138,184 |
EUR-Lex Czech 2/2016 |
Czech |
trial |
350,230,088 |
EUR-Lex Danish 2/2016 |
Danish |
trial |
519,765,085 |
EUR-Lex Dutch 2/2016 |
Dutch |
trial |
583,263,688 |
EUR-Lex English 2/2016 |
English |
trial |
629,722,593 |
EUR-Lex Estonian 2/2016 |
Estonian |
trial |
291,077,511 |
EUR-Lex Finnish 2/2016 |
Finnish |
trial |
384,119,975 |
EUR-Lex French 2/2016 |
French |
trial |
677,063,993 |
EUR-Lex German 2/2016 |
German |
trial |
528,617,843 |
EUR-Lex Greek 2/2016 |
Greek |
trial |
579,344,223 |
EUR-Lex Hungarian 2/2016 |
Hungarian |
trial |
340,618,970 |
EUR-Lex Irish 2/2016 |
Irish |
trial |
31,439,542 |
EUR-Lex Italian 2/2016 |
Italian |
trial |
606,070,097 |
EUR-Lex judgments Bulgarian 12/2016 |
Bulgarian |
trial |
17,071,495 |
EUR-Lex judgments Croatian 12/2016 |
Croatian |
trial |
5,613,468 |
EUR-Lex judgments Czech 12/2016 |
Czech |
trial |
18,226,505 |
EUR-Lex judgments Danish 12/2016 |
Danish |
trial |
34,934,021 |
EUR-Lex judgments Dutch 12/2016 |
Dutch |
trial |
40,534,071 |
EUR-Lex judgments English 12/2016 |
English |
trial |
42,339,337 |
EUR-Lex judgments Estonian 12/2016 |
Estonian |
trial |
15,029,608 |
EUR-Lex judgments Finnish 12/2016 |
Finnish |
trial |
23,601,422 |
EUR-Lex judgments French 12/2016 |
French |
trial |
48,023,524 |
EUR-Lex judgments German 12/2016 |
German |
trial |
35,297,517 |
EUR-Lex judgments Greek 12/2016 |
Greek |
trial |
35,815,108 |
EUR-Lex judgments Hungarian 12/2016 |
Hungarian |
trial |
17,940,879 |
EUR-Lex judgments Italian 12/2016 |
Italian |
trial |
42,053,315 |
EUR-Lex judgments Latvian 12/2016 |
Latvian |
trial |
16,908,831 |
EUR-Lex judgments Lithuanian 12/2016 |
Lithuanian |
trial |
16,252,111 |
EUR-Lex judgments Maltese 12/2016 |
Maltese |
trial |
19,146,797 |
EUR-Lex judgments Polish 12/2016 |
Polish |
trial |
18,799,551 |
EUR-Lex judgments Portuguese 12/2016 |
Portuguese |
trial |
35,412,936 |
EUR-Lex judgments Romanian 12/2016 |
Romanian |
trial |
17,592,388 |
EUR-Lex judgments Slovak 12/2016 |
Slovak |
trial |
18,265,664 |
EUR-Lex judgments Slovenian 12/2016 |
Slovenian |
trial |
18,439,766 |
EUR-Lex judgments Spanish 12/2016 |
Spanish |
trial |
39,431,836 |
EUR-Lex judgments Swedish 12/2016 |
Swedish |
trial |
30,666,764 |
EUR-Lex Latvian 2/2016 |
Latvian |
trial |
324,734,544 |
EUR-Lex Lithuanian 2/2016 |
Lithuanian |
trial |
323,151,426 |
EUR-Lex Maltese 2/2016 |
Maltese |
trial |
314,396,006 |
EUR-Lex Polish 2/2016 |
Polish |
trial |
360,862,149 |
EUR-Lex Portuguese 2/2016 |
Portuguese |
trial |
595,066,701 |
EUR-Lex Romanian 2/2016 |
Romanian |
trial |
336,928,068 |
EUR-Lex Slovak 2/2016 |
Slovak |
trial |
255,531,673 |
EUR-Lex Slovenian 2/2016 |
Slovenian |
trial |
351,899,258 |
EUR-Lex Spanish 2/2016 |
Spanish |
trial |
635,187,126 |
EUR-Lex Swedish 2/2016 |
Swedish |
trial |
478,485,126 |
EUROPARL7 sample, English |
English |
open |
15,099,625 |
EUROPARL7 sample, French |
French |
open |
16,815,290 |
EUROPARL7 sample, Polish |
Polish |
open |
13,034,164 |
EUROPARL7 sample, Spanish |
Spanish |
open |
15,513,307 |
EUROPARL7, Bulgarian |
Bulgarian |
trial |
9,215,233 |
EUROPARL7, Czech |
Czech |
trial |
13,013,774 |
EUROPARL7, Danish |
Danish |
trial |
48,343,860 |
EUROPARL7, Dutch |
Dutch |
trial |
54,007,722 |
EUROPARL7, English |
English |
trial |
53,837,625 |
EUROPARL7, Estonian |
Estonian |
trial |
11,171,727 |
EUROPARL7, Finnish |
Finnish |
trial |
34,182,031 |
EUROPARL7, French |
French |
trial |
59,145,988 |
EUROPARL7, German |
German |
trial |
47,805,055 |
EUROPARL7, Greek |
Greek |
trial |
38,868,863 |
EUROPARL7, Hungarian |
Hungarian |
trial |
12,421,715 |
EUROPARL7, Italian |
Italian |
trial |
52,871,060 |
EUROPARL7, Latvian |
Latvian |
trial |
11,920,085 |
EUROPARL7, Lithuanian |
Lithuanian |
trial |
11,424,032 |
EUROPARL7, Polish |
Polish |
trial |
13,034,164 |
EUROPARL7, Portuguese |
Portuguese |
trial |
53,778,766 |
EUROPARL7, Romanian |
Romanian |
trial |
9,554,864 |
EUROPARL7, Slovak |
Slovak |
trial |
12,942,651 |
EUROPARL7, Slovenian |
Slovenian |
trial |
12,496,942 |
EUROPARL7, Spanish |
Spanish |
trial |
54,302,284 |
EUROPARL7, Swedish |
Swedish |
trial |
46,303,799 |
European Spanish Web 2011 (eseuTenTen11) |
Spanish |
trial |
2,021,633,644 |
Finnish Web 2014 (fiTenTen14) |
Finnish |
trial |
1,404,083,812 |
Finnish Web 2014 (fiTenTen14, TreeTagger v2) |
Finnish |
main |
1,404,100,049 |
Finnish Web 2014 sample (fiTenTen14, TreeTagger v2) |
Finnish |
trial |
40,756,118 |
Frantext (French literature of the 18th-20th century) |
French |
main |
15,573,070 |
Frantext (French literature of the 18th-20th century), without trends |
French |
main |
15,573,070 |
French corpus of 88,000 SMS (88milSMS) |
French |
trial |
1,206,663 |
French Web 2008 (v2 with lempos) |
French |
main |
104,705,211 |
French Web 2010 (frWaC) |
French |
main |
1,330,564,200 |
French Web 2012 (frTenTen12) |
French |
trial |
9,889,689,889 |
French Web 2012 sample |
French |
trial |
205,185,797 |
French Web 2017 (frTenTen17) |
French |
trial |
5,752,261,039 |
French Web 2017 sample |
French |
trial |
404,555,405 |
Georgian Web 2013 (kaWaC) |
Georgian |
trial |
50,713,604 |
German Corpus for SkELL 1.0 |
German |
main |
769,810,745 |
German Political Speeches Corpus |
German |
trial |
11,144,258 |
German Web 2010 |
German |
main |
2,338,036,362 |
German Web 2010 (deWaC) |
German |
main |
1,348,188,416 |
German Web 2013 (deTenTen13) |
German |
main |
16,526,335,416 |
German Web 2013 sample |
German |
trial |
193,838,751 |
German Web 2018 (deTenTen18) |
German |
trial |
5,346,041,196 |
German Web 2020 (deTenTen20) |
German |
trial |
17,512,733,172 |
GerManC (German Newspapers 1650-1800) |
German |
main |
667,310 |
Gigafida v2.0 (referenčni) |
Slovenian |
main |
1,109,441,592 |
Greek Web (GkWaC with lempos) |
Greek |
main |
124,285,612 |
Greek Web 2014 (elTenTen14) |
Greek |
main |
1,671,692,845 |
Greek Web 2019 (elTenTen19) |
Greek |
trial |
2,342,091,029 |
Guangwai - Lancaster Chinese Learner Corpus |
Chinese Simplified |
open |
1,289,060 |
Gujarati Web (guWaC) |
Gujarati |
trial |
17,960,095 |
Gutenberg Afrikaans 2020 |
Afrikaans |
main |
157,606 |
Gutenberg Bulgarian 2020 |
Bulgarian |
main |
33,352 |
Gutenberg Catalan 2020 |
Catalan |
main |
1,320,242 |
Gutenberg Chinese Traditional 2020 |
Chinese Traditional |
main |
27,136,782 |
Gutenberg Czech 2020 |
Czech |
main |
364,683 |
Gutenberg Danish 2020 |
Danish |
main |
3,959,344 |
Gutenberg Dutch 2020 |
Dutch |
main |
87,390,658 |
Gutenberg English 2020 |
English |
main |
2,903,177,585 |
Gutenberg Esperanto 2020 |
Esperanto |
main |
2,024,013 |
Gutenberg Finnish 2020 |
Finnish |
main |
68,174,366 |
Gutenberg French 2020 |
French |
main |
197,560,500 |
Gutenberg German 2020 |
German |
main |
74,709,930 |
Gutenberg Greek 2020 |
Greek |
main |
7,837,742 |
Gutenberg Hebrew 2020 |
Hebrew |
main |
158,212 |
Gutenberg Hungarian 2020 |
Hungarian |
main |
9,140,833 |
Gutenberg Icelandic 2020 |
Icelandic |
main |
82,211 |
Gutenberg Italian 2020 |
Italian |
main |
93,049,296 |
Gutenberg Japanese 2020 |
Japanese |
main |
963,368 |
Gutenberg Latin 2020 |
Latin |
main |
3,871,335 |
Gutenberg Norwegian Bokmål 2020 |
Norwegian Bokmål |
main |
762,295 |
Gutenberg Polish 2020 |
Polish |
main |
421,318 |
Gutenberg Portuguese 2020 |
Portuguese |
main |
14,309,476 |
Gutenberg Russian 2020 |
Russian |
main |
13,643 |
Gutenberg Serbian 2020 |
Serbian |
main |
70,724 |
Gutenberg Spanish 2020 |
Spanish |
main |
37,202,233 |
Gutenberg Swedish 2020 |
Swedish |
main |
7,919,783 |
Gutenberg Tagalog 2020 |
Tagalog |
main |
2,468,064 |
Gutenberg Telugu 2020 |
Telugu |
main |
157,077 |
Gutenberg Welsh 2020 |
Welsh |
main |
221,733 |
Hausa Web 2015 (hausaWaC15) |
Hausa (Boko) |
trial |
5,304,300 |
Hebrew General Corpus (web crawled, mostly newspapers) |
Hebrew |
main |
157,947,728 |
Hebrew Translation Corpus |
Hebrew |
trial |
1,180,003 |
Hebrew Web (HebWaC) |
Hebrew |
main |
47,832,254 |
Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) |
Hebrew |
access on demand |
895,876,116 |
Hebrew Web 2014 (heTenTen14, no POS tagging) |
Hebrew |
main |
890,282,843 |
Hebrew Web 2021 (heTenTen21) |
Hebrew |
trial |
2,775,686,699 |
Hindi Web 2012 (HindiWaC v. 4) |
Hindi |
trial |
107,960,109 |
Hindi Web 2013 (hiTenTen13) |
Hindi |
main |
351,289,441 |
Hindi Web 2017 (hiTenTen17) |
Hindi |
main |
1,228,379,747 |
Hindi Web 2021 (hiTenTen21) |
Hindi |
trial |
792,395,313 |
Hungarian Web 2012 (huTenTen12) |
Hungarian |
trial |
2,572,620,694 |
Icelandic texts [sample] |
Icelandic |
trial |
5,436,035 |
Icelandic Web 2020 (isTenTen20) |
Icelandic |
trial |
518,620,759 |
Igbo Web 2015 (IgboWaC15) |
Igbo |
main |
331,042 |
Igbo Web 2017 (igTenTen17) |
Igbo |
trial |
629,294 |
Indonesian Web (IndonesianWaC) |
Indonesian |
trial |
90,120,046 |
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) |
Irish |
open |
314,846 |
Italian Corpus for SkELL 1.0 |
Italian |
main |
328,270,600 |
Italian Web 2006 (itWaC) |
Italian |
main |
1,597,295,469 |
Italian Web 2010 (itTenTen) |
Italian |
main |
2,588,873,046 |
Italian Web 2016 (itTenTen16) |
Italian |
trial |
4,989,729,171 |
Italian Web 2016 sample |
Italian |
trial |
201,204,942 |
Italian Web 2020 (itTenTen20) |
Italian |
trial |
12,451,734,885 |
itWAC (reduced) |
Italian |
main |
751,542,948 |
Japanese Web 2006 (jpWaC) |
Japanese |
main |
336,867,039 |
Japanese Web 2011 (jaTenTen11) |
Japanese |
trial |
8,432,256,578 |
Japanese Web 2011 (jaTenTen11, sample) |
Japanese |
main |
301,407,652 |
Japanese Web 2011 sample (jaTenTen11, LUW) |
Japanese |
trial |
163,837,671 |
Kannada Web 2012 (knWaC12) |
Kannada |
trial |
11,056,526 |
KAS-Dipl (diplome) |
Slovenian |
main |
568,188,810 |
KAS-Dr (doktorati) |
Slovenian |
main |
30,244,519 |
KAS-Mag (magisteriji) |
Slovenian |
main |
157,168,378 |
Khmer Web 2018 (kmTenTen18) |
Khmer |
trial |
16,500,379 |
Khmer Web 2021 (kmTenTen21) |
Khmer |
trial |
103,066,083 |
Korean 2018 term reference corpus (koTenTen18_term_ref) |
Korean |
trial |
83,749,660 |
Korean Web 2012 (koTenTen12) |
Korean |
main |
461,196,240 |
Korean Web 2018 (koTenTen18) |
Korean |
trial |
1,668,851,720 |
KSUCCA (Classical Arabic) |
Arabic |
trial |
46,705,577 |
Lao Web 2018 (loTenTen18) |
Lao |
trial |
15,862,991 |
Lao Web 2019 (loTenTen19) |
Lao |
trial |
105,018,584 |
LatinISE corpus |
Latin |
trial |
11,202,216 |
Latvian Web (LatvianWaC) |
Latvian |
main |
57,666,024 |
Latvian Web 2014 (lvTenTen14) |
Latvian |
trial |
530,367,474 |
Lektor (Learner corpus of proofread and translations) |
Slovenian |
main |
953,038 |
LEXMCI |
English |
main |
1,448,180,339 |
Lithuanian Web (LithuanianWaC v2) |
Lithuanian |
main |
48,650,918 |
Lithuanian Web 2014 (ltTenTen14) |
Lithuanian |
trial |
778,151,979 |
Magpie corpus |
English |
main |
4,597,782 |
MagyarOK teaching materials for Hungarian, levels A1 to B2 |
Hungarian |
open |
144,832 |
Malayalam Web (malayalamWaC) |
Malayalam |
trial |
15,950,663 |
Malaysian Web (MalaysianWaC) |
Malay |
trial |
182,578,743 |
Maldivian Wikipedia corpus 2019 (dvwiki) |
Maldivian |
trial |
548,211 |
Maltese MLRS Corpus |
Maltese |
trial |
110,714,844 |
Maori Web 2013 and 2020 (miTenTen20) |
Maori |
trial |
11,814,825 |
Medical Web Corpus |
English |
main |
33,961,786 |
METCLIL: Metaphor in EMI seminars |
English |
open |
110,493 |
Mongolian Web Texts 2016 (mnWaC16) |
Mongolian |
trial |
6,104,565 |
Mueller Report |
English |
trial |
167,103 |
Multicultural London English Corpus |
English |
main |
2,391,040 |
Nepali National Corpus |
Nepali |
trial |
13,440,835 |
Nepali Web (NepaliWaC) |
Nepali |
main |
1,290,388 |
New corpus for English (NCI English) |
English |
main |
217,548,758 |
New Model Corpus |
English |
main |
95,276,958 |
Newspapers in Portuguese (CetemPúblico, CetenFolha) |
Portuguese |
main |
56,768,822 |
Norwegian dictionary corpus (Nynorskkorpuset) |
Norwegian |
main |
74,496,664 |
Norwegian Web 2012 |
Norwegian |
main |
669,511,569 |
Norwegian Web 2017 (noTenTen17, Bokmål) |
Norwegian Bokmål |
trial |
2,472,483,911 |
Norwegian Web 2017 (noTenTen17, Nynorsk) |
Norwegian Nynorsk |
trial |
174,830,652 |
Norwegian Web 2017 sample (Bokmål) |
Norwegian Bokmål |
trial |
58,955,519 |
Norwegian Web 2017 sample (Nynorsk) |
Norwegian Nynorsk |
trial |
58,743,828 |
OEC |
English |
access on demand |
2,073,319,589 |
OEC v2 |
English |
access on demand |
2,073,563,928 |
Open Access Journals (DOAJ - English) |
English |
trial |
2,662,763,697 |
Open American National Corpus (spoken) |
English |
main |
3,202,026 |
Open American National Corpus (written) |
English |
main |
11,048,137 |
Open Cambridge Learner Corpus (Uncoded) |
English |
access on demand |
2,975,701 |
OpenSubtitles 2018 - Afrikaans |
Afrikaans |
main |
341,349 |
OpenSubtitles 2018 - Albanian |
Albanian |
main |
15,662,170 |
OpenSubtitles 2018 - Arabic |
Arabic |
main |
333,329,378 |
OpenSubtitles 2018 - Armenian |
Armenian |
main |
24,216 |
OpenSubtitles 2018 - Basque |
Basque |
main |
3,919,829 |
OpenSubtitles 2018 - Bengali |
Bengali |
main |
2,270,841 |
OpenSubtitles 2018 - Bosnian |
Bosnian |
main |
125,323,299 |
OpenSubtitles 2018 - Brazilian Portuguese |
Portuguese |
main |
545,598,189 |
OpenSubtitles 2018 - Breton |
Breton |
main |
85,503 |
OpenSubtitles 2018 - Bulgarian |
Bulgarian |
main |
371,685,493 |
OpenSubtitles 2018 - Catalan |
Catalan |
main |
3,273,561 |
OpenSubtitles 2018 - Chinese Simplified |
Chinese Simplified |
main |
119,998,854 |
OpenSubtitles 2018 - Chinese Traditional |
Chinese Traditional |
main |
41,876,166 |
OpenSubtitles 2018 - Croatian |
Croatian |
main |
370,177,938 |
OpenSubtitles 2018 - Czech |
Czech |
main |
453,218,524 |
OpenSubtitles 2018 - Danish |
Danish |
main |
135,228,416 |
OpenSubtitles 2018 - Dutch |
Dutch |
main |
444,413,064 |
OpenSubtitles 2018 - English |
English |
main |
1,211,666,401 |
OpenSubtitles 2018 - Esperanto |
Esperanto |
main |
396,790 |
OpenSubtitles 2018 - Estonian |
Estonian |
main |
107,391,459 |
OpenSubtitles 2018 - Finnish |
Finnish |
main |
175,247,181 |
OpenSubtitles 2018 - French |
French |
main |
462,749,061 |
OpenSubtitles 2018 - Galician |
Galician |
main |
1,572,312 |
OpenSubtitles 2018 - Georgian |
Georgian |
main |
1,157,136 |
OpenSubtitles 2018 - German |
German |
main |
185,133,927 |
OpenSubtitles 2018 - Greek |
Greek |
main |
457,347,003 |
OpenSubtitles 2018 - Hebrew |
Hebrew |
main |
371,473,205 |
OpenSubtitles 2018 - Hindi |
Hindi |
main |
675,322 |
OpenSubtitles 2018 - Hungarian |
Hungarian |
main |
378,525,740 |
OpenSubtitles 2018 - Icelandic |
Icelandic |
main |
9,194,074 |
OpenSubtitles 2018 - Indonesian |
Indonesian |
main |
77,273,767 |
OpenSubtitles 2018 - Italian |
Italian |
main |
431,415,848 |
OpenSubtitles 2018 - Japanese |
Japanese |
main |
15,224,480 |
OpenSubtitles 2018 - Kazakh |
Kazakh |
main |
14,172 |
OpenSubtitles 2018 - Korean |
Korean |
main |
7,432,927 |
OpenSubtitles 2018 - Latvian |
Latvian |
main |
2,494,901 |
OpenSubtitles 2018 - Lithuanian |
Lithuanian |
main |
6,806,857 |
OpenSubtitles 2018 - Macedonian |
Macedonian |
main |
28,859,153 |
OpenSubtitles 2018 - Malay |
Malay |
main |
13,465,077 |
OpenSubtitles 2018 - Malayalam |
Malayalam |
main |
1,671,708 |
OpenSubtitles 2018 - Norwegian |
Norwegian |
main |
61,215,172 |
OpenSubtitles 2018 - Persian |
Persian |
main |
53,444,595 |
OpenSubtitles 2018 - Polish |
Polish |
main |
496,167,686 |
OpenSubtitles 2018 - Portuguese |
Portuguese |
main |
466,021,603 |
OpenSubtitles 2018 - Romanian |
Romanian |
main |
658,289,867 |
OpenSubtitles 2018 - Russian |
Russian |
main |
180,032,832 |
OpenSubtitles 2018 - Serbian |
Serbian |
main |
480,367,760 |
OpenSubtitles 2018 - Sinhalese |
Sinhalese |
main |
3,430,727 |
OpenSubtitles 2018 - Slovak |
Slovak |
main |
66,455,056 |
OpenSubtitles 2018 - Slovenian |
Slovenian |
main |
198,366,873 |
OpenSubtitles 2018 - Spanish |
Spanish |
main |
753,235,853 |
OpenSubtitles 2018 - Swedish |
Swedish |
main |
153,717,474 |
OpenSubtitles 2018 - Tagalog |
Tagalog |
main |
96,291 |
OpenSubtitles 2018 - Tamil |
Tamil |
main |
132,055 |
OpenSubtitles 2018 - Telugu |
Telugu |
main |
109,730 |
OpenSubtitles 2018 - Thai |
Thai |
main |
33,223,171 |
OpenSubtitles 2018 - Turkish |
Turkish |
main |
461,809,489 |
OpenSubtitles 2018 - Ukrainian |
Ukrainian |
main |
5,049,556 |
OpenSubtitles 2018 - Urdu |
Urdu |
main |
229,947 |
OpenSubtitles 2018 - Vietnamese |
Vietnamese |
main |
31,848,385 |
Opus MontenegrinSubs: English |
English |
trial |
468,337 |
Opus MontenegrinSubs: Montenegrin |
Montenegrin |
trial |
365,698 |
OPUS2 Afrikaans |
Afrikaans |
main |
586,334 |
OPUS2 Albanian |
Albanian |
trial |
46,304,346 |
OPUS2 Arabic |
Arabic |
main |
300,000,057 |
OPUS2 Bosnian |
Bosnian |
main |
43,582,516 |
OPUS2 Brazilian Portuguese |
Portuguese |
main |
272,300,927 |
OPUS2 Bulgarian |
Bulgarian |
main |
183,115,244 |
OPUS2 Chinese Simplified |
Chinese Simplified |
main |
243,427,123 |
OPUS2 Chinese Traditional |
Chinese Traditional |
main |
380,245 |
OPUS2 Croatian |
Croatian |
main |
121,369,625 |
OPUS2 Czech |
Czech |
main |
203,845,619 |
OPUS2 Danish |
Danish |
main |
120,107,271 |
OPUS2 Dutch |
Dutch |
main |
356,363,571 |
OPUS2 English |
English |
main |
1,139,515,048 |
OPUS2 Estonian |
Estonian |
main |
64,879,741 |
OPUS2 Finnish |
Finnish |
main |
131,985,872 |
OPUS2 French |
French |
main |
766,833,908 |
OPUS2 German |
German |
main |
125,229,773 |
OPUS2 Greek |
Greek |
main |
239,360,926 |
OPUS2 Hebrew |
Hebrew |
main |
130,972,343 |
OPUS2 Hindi |
Hindi |
main |
854,741 |
OPUS2 Hungarian |
Hungarian |
main |
157,495,018 |
OPUS2 Italian |
Italian |
main |
180,532,849 |
OPUS2 Japanese |
Japanese |
main |
5,455,106 |
OPUS2 Korean |
Korean |
main |
374,850 |
OPUS2 Latvian |
Latvian |
main |
24,499,516 |
OPUS2 Lithuanian |
Lithuanian |
main |
29,621,940 |
OPUS2 Macedonian |
Macedonian |
trial |
40,348,792 |
OPUS2 Norwegian |
Norwegian |
main |
20,237,510 |
OPUS2 Persian |
Persian |
trial |
4,425,133 |
OPUS2 Polish |
Polish |
main |
208,008,636 |
OPUS2 Portuguese |
Portuguese |
main |
297,700,205 |
OPUS2 Romanian |
Romanian |
main |
282,408,295 |
OPUS2 Russian |
Russian |
main |
307,709,872 |
OPUS2 Serbian |
Serbian |
main |
153,237,786 |
OPUS2 Slovak |
Slovak |
main |
62,451,407 |
OPUS2 Slovenian |
Slovenian |
main |
121,228,966 |
OPUS2 Spanish |
Spanish |
main |
701,944,027 |
OPUS2 Swedish |
Swedish |
main |
102,298,686 |
OPUS2 Turkish |
Turkish |
main |
151,342,424 |
OPUS2 Ukrainian |
Ukrainian |
main |
2,578,289 |
Oromo Web 2016 (orWaC16) |
Oromo |
trial |
4,249,953 |
Oxford Children's Corpus 2015 (PTag) |
English |
access on demand |
210,322,185 |
Oxford Children's Corpus 2015 -- Education (PTag) |
English |
access on demand |
1,323,174 |
Oxford Children's Corpus 2015 -- Reading (PTag) |
English |
access on demand |
34,284,687 |
Oxford Children's Corpus 2015 -- Writing (PTag) |
English |
access on demand |
174,714,324 |
Oxford Children's Corpus 2016 (PTag) |
English |
access on demand |
284,360,063 |
Oxford Children's Corpus 2016 -- Reading (PTag) |
English |
access on demand |
53,858,955 |
Oxford Children's Corpus 2016 -- Writing (PTag) |
English |
access on demand |
229,177,934 |
Oxford Corpus of Academic English (OCAE, April 2012) |
English |
access on demand |
71,371,739 |
Paisa |
Italian |
main |
221,989,288 |
ParlaMint-BE 2.1 (Belgian parliament) |
French |
main |
30,864,767 |
ParlaMint-BG 2.1 (Bulgarian parliament) |
Bulgarian |
main |
19,096,761 |
ParlaMint-CZ 2.1 (Czech parliament) |
Czech |
main |
22,104,199 |
ParlaMint-DK 2.1 (Danish parliament) |
Danish |
main |
29,205,018 |
ParlaMint-ES 2.1 (Spanish parliament) |
Spanish |
main |
12,930,870 |
ParlaMint-FR 2.1 (French parliament) |
French |
main |
32,176,380 |
ParlaMint-GB 2.1 (British parliament) |
English |
main |
100,967,492 |
ParlaMint-HR 2.1 (Croatian parliament) |
Croatian |
main |
20,342,230 |
ParlaMint-HU 2.1 (Hungarian parliament) |
Hungarian |
main |
856,543 |
ParlaMint-IS 2.1 (Icelandic parliament) |
Icelandic |
main |
23,461,109 |
ParlaMint-IT 2.1 (Italian parliament) |
Italian |
main |
26,571,966 |
ParlaMint-LT 2.1 (Lithuanian parliament) |
Lithuanian |
main |
14,428,682 |
ParlaMint-LV 2.1 (Latvian parliament) |
Latvian |
main |
6,342,984 |
ParlaMint-NL 2.1 (Dutch parliament) |
Dutch |
main |
51,156,406 |
ParlaMint-PL 2.1 (Polish parliament) |
Polish |
main |
26,882,964 |
ParlaMint-SI 2.1 (Slovenian parliament) |
Slovenian |
main |
19,933,836 |
ParlaMint-TR 2.1 (Turkish parliament) |
Turkish |
main |
42,913,306 |
Parsed German Web (sDeWaC) |
German |
main |
755,165,551 |
Penn Corpora of Historical English |
English |
access on demand |
3,800,639 |
PICAE 2010 |
English |
access on demand |
31,025,920 |
Polish Web (PolishWac, Morfeusz and TaKIPI tagger) |
Polish |
main |
103,028,410 |
Polish Web 2012 (plTenTen12, RFTagger) |
Polish |
main |
7,715,835,214 |
Polish Web 2012 sample |
Polish |
trial |
191,648,244 |
Polish Web 2019 (plTenTen19) |
Polish |
trial |
4,253,636,443 |
Portuguese Web 2011 (ptTenTen11) |
Portuguese |
trial |
3,896,392,719 |
Portuguese Web 2011 (ptTenTen11, Palavras parsed) |
Portuguese |
main |
2,757,635,105 |
Portuguese Web 2011 sample |
Portuguese |
trial |
202,548,549 |
Project Gutenberg English |
English |
main |
443,471,071 |
pukWaC (ukWaC parsed with MaltParser) |
English |
main |
39,496,785 |
Quran annotated corpus [unvowelled Arabic] |
Arabic |
main |
128,243 |
Quran annotated corpus [unvowelled Latin] |
Arabic |
main |
99,268 |
Quran annotated corpus [vowelled Arabic] |
Arabic |
main |
128,241 |
Quran annotated corpus [vowelled Latin] |
Arabic |
main |
97,970 |
RapCor1292 - Francophone rap songs |
French |
trial |
710,891 |
Riznica v0.1 |
Croatian |
main |
85,273,724 |
Romanian Web 2016 (roTenTen16) |
Romanian |
trial |
2,640,496,763 |
ruSkELL 1.6 |
Russian |
main |
975,584,449 |
Russian Sites in Estonian Web 2017–2021 |
Russian |
main |
300,702,055 |
Russian Web 2006 (v2 with lempos) |
Russian |
main |
147,930,261 |
Russian Web 2011 (ruTenTen11) |
Russian |
trial |
14,553,856,113 |
Russian Web 2011 sample (ruTenTen11) |
Russian |
trial |
998,099,963 |
Russian Web 2017 (ruTenTen17) |
Russian |
main |
9,034,837,939 |
Samoan Web (SamoanWac1) |
Samoan |
trial |
3,115,385 |
ScienceBlogs |
English |
main |
103,175,233 |
Scottish Gaelic Wiki 2015 (gdWiki) |
Scottish Gaelic |
trial |
980,026 |
Semcor v3.0 (sense-tagged corpus) |
English |
main |
664,038 |
Serbian Web (srWaC 1.2 processed by Hunpos) |
Serbian |
trial |
477,724,164 |
Serbian Web (srWaC 1.2 processed by RFTagger v1) |
Serbian (Latin) |
trial |
441,888,202 |
Serbian Web (srWaC 1.2) |
Serbian (Latin) |
trial |
476,888,297 |
Setswana/Tswana Web (SetswanaWaC v2) |
Setswana |
trial |
11,496,687 |
Slovak Web 2011 (skTenTen11) |
Slovak |
trial |
540,112,634 |
Slovak Web 2011 (skTenTen11, ambiguity tag, lempos) |
Slovak |
main |
715,707,053 |
Slovak Web 2011 sample |
Slovak |
trial |
189,609,195 |
Slovenian reference corpus (FidaPLUS v2) |
Slovenian |
trial |
600,309,637 |
Slovenian Web (slWaC 2.1 processed with TreeTagger v2) |
Slovenian |
trial |
755,255,547 |
Slovenian Web (slWaC 2.1) |
Slovenian |
trial |
754,255,589 |
Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
Slovenian |
trial |
829,544,337 |
Slovenian Web 2015 sample |
Slovenian |
trial |
195,792,821 |
Somali Web 2016 (soWaC16) |
Somali |
trial |
71,871,585 |
SoNaR |
Dutch |
access on demand |
425,978,755 |
Spanish Web 2005 (SpanishWaC) |
Spanish |
main |
97,773,185 |
Spanish Web 2011 (esTenTen11, Eu + Am) |
Spanish |
trial |
9,497,213,009 |
Spanish Web 2011 sample |
Spanish |
trial |
212,142,794 |
Spanish Web 2018 (esTenTen18) |
Spanish |
trial |
16,951,839,897 |
Spanish Web 2018 sample |
Spanish |
trial |
177,257,648 |
Susanne |
English |
trial |
128,998 |
Swahili Web 2014 (swWaC) |
Swahili |
trial |
17,882,483 |
Swedish Web 2014 (svTenTen14) |
Swedish |
trial |
3,401,035,817 |
Swedish Web 2014 sample |
Swedish |
trial |
45,477,881 |
SwedishParole |
Swedish |
main |
21,735,113 |
Tagalog (Filipino) Web 2019 (tlTenTen19) |
Tagalog |
trial |
198,303,250 |
Tajik Web (TajikWaC) |
Tajik |
trial |
93,151,897 |
TalkBank Persian (blog posts) |
Persian |
main |
474,773,547 |
Tamil Web 2015 (TamilWaC) |
Tamil |
trial |
26,750,515 |
Tatar Mixed Corpus |
Tatar |
trial |
102,779,803 |
Tatar News (2000-2014), version with lempos |
Tatar |
main |
24,927,439 |
Tatar Web 2015 sample |
Tatar |
trial |
195,901 |
Ted Talks transcripts |
English |
main |
2,882,085 |
Telugu Web 2017 (teTenTen) |
Telugu |
trial |
126,807,158 |
Thai Web (ThaiWaC) |
Thai |
trial |
82,787,119 |
Thai Web 2018 (thTenTen18) |
Thai |
trial |
640,530,227 |
The Annotated Corpus of Classical Tibetan (ACTib 2.0) |
Tibetan |
trial |
170,202,078 |
The New Corpus for Ireland |
Irish |
main |
29,886,201 |
Tigrinya Web 2016 (tiWaC16) |
Tigrinya |
trial |
2,087,613 |
Timestamped JSI web corpus 2014-2016 Arabic |
Arabic |
trial |
980,016,943 |
Timestamped JSI web corpus 2014-2016 Catalan |
Catalan |
trial |
99,395,494 |
Timestamped JSI web corpus 2014-2016 Czech |
Czech |
trial |
289,488,005 |
Timestamped JSI web corpus 2014-2016 Dutch |
Dutch |
trial |
401,347,934 |
Timestamped JSI web corpus 2014-2016 English |
English |
trial |
18,315,071,361 |
Timestamped JSI web corpus 2014-2016 Finnish |
Finnish |
trial |
119,109,490 |
Timestamped JSI web corpus 2014-2016 French |
French |
trial |
1,870,341,756 |
Timestamped JSI web corpus 2014-2016 German |
German |
trial |
1,987,759,563 |
Timestamped JSI web corpus 2014-2016 Hebrew |
Hebrew |
trial |
111,339,363 |
Timestamped JSI web corpus 2014-2016 Hungarian |
Hungarian |
trial |
180,843,359 |
Timestamped JSI web corpus 2014-2016 Italian |
Italian |
trial |
1,375,907,374 |
Timestamped JSI web corpus 2014-2016 Korean |
Korean |
trial |
438,816,127 |
Timestamped JSI web corpus 2014-2016 Polish |
Polish |
trial |
157,930,228 |
Timestamped JSI web corpus 2014-2016 Portuguese |
Portuguese |
trial |
1,109,771,393 |
Timestamped JSI web corpus 2014-2016 Russian |
Russian |
trial |
1,120,731,416 |
Timestamped JSI web corpus 2014-2016 Serbian |
Serbian |
trial |
86,380,673 |
Timestamped JSI web corpus 2014-2016 Spanish |
Spanish |
trial |
4,055,944,612 |
Timestamped JSI web corpus 2014-2016 Swedish |
Swedish |
trial |
335,782,681 |
Timestamped JSI web corpus 2014-2021 Arabic |
Arabic |
main |
4,901,614,300 |
Timestamped JSI web corpus 2014-2021 Catalan |
Catalan |
main |
449,634,119 |
Timestamped JSI web corpus 2014-2021 Czech |
Czech |
main |
1,031,396,604 |
Timestamped JSI web corpus 2014-2021 Dutch |
Dutch |
main |
1,390,833,141 |
Timestamped JSI web corpus 2014-2021 English |
English |
main |
60,409,480,489 |
Timestamped JSI web corpus 2014-2021 Finnish |
Finnish |
main |
421,879,841 |
Timestamped JSI web corpus 2014-2021 French |
French |
main |
6,998,186,326 |
Timestamped JSI web corpus 2014-2021 German |
German |
main |
7,055,641,455 |
Timestamped JSI web corpus 2014-2021 Hebrew |
Hebrew |
main |
466,851,576 |
Timestamped JSI web corpus 2014-2021 Hungarian |
Hungarian |
main |
903,862,798 |
Timestamped JSI web corpus 2014-2021 Italian |
Italian |
main |
8,730,808,429 |
Timestamped JSI web corpus 2014-2021 Korean |
Korean |
main |
1,576,995,357 |
Timestamped JSI web corpus 2014-2021 Polish |
Polish |
main |
973,863,152 |
Timestamped JSI web corpus 2014-2021 Portuguese |
Portuguese |
main |
4,685,199,909 |
Timestamped JSI web corpus 2014-2021 Russian |
Russian |
main |
5,788,590,952 |
Timestamped JSI web corpus 2014-2021 Serbian |
Serbian |
main |
565,311,513 |
Timestamped JSI web corpus 2014-2021 Spanish |
Spanish |
main |
16,358,148,966 |
Timestamped JSI web corpus 2014-2021 Swedish |
Swedish |
main |
1,162,692,802 |
Timestamped JSI web corpus 2021-01 English |
English |
main |
940,554,284 |
Timestamped JSI web corpus 2021-02 English |
English |
main |
325,034,580 |
Timestamped JSI web corpus 2021-03 Arabic |
Arabic |
main |
104,835,755 |
Timestamped JSI web corpus 2021-03 Catalan |
Catalan |
main |
12,107,597 |
Timestamped JSI web corpus 2021-03 Czech |
Czech |
main |
20,431,801 |
Timestamped JSI web corpus 2021-03 Dutch |
Dutch |
main |
31,428,324 |
Timestamped JSI web corpus 2021-03 English |
English |
main |
988,199,655 |
Timestamped JSI web corpus 2021-03 Finnish |
Finnish |
main |
6,154,402 |
Timestamped JSI web corpus 2021-03 French |
French |
main |
145,384,862 |
Timestamped JSI web corpus 2021-03 German |
German |
main |
126,775,824 |
Timestamped JSI web corpus 2021-03 Hebrew |
Hebrew |
main |
8,450,710 |
Timestamped JSI web corpus 2021-03 Hungarian |
Hungarian |
main |
30,439,114 |
Timestamped JSI web corpus 2021-03 Italian |
Italian |
main |
365,307,999 |
Timestamped JSI web corpus 2021-03 Korean |
Korean |
main |
19,324,576 |
Timestamped JSI web corpus 2021-03 Polish |
Polish |
main |
38,911,481 |
Timestamped JSI web corpus 2021-03 Portuguese |
Portuguese |
main |
108,540,406 |
Timestamped JSI web corpus 2021-03 Russian |
Russian |
main |
150,971,438 |
Timestamped JSI web corpus 2021-03 Serbian |
Serbian |
main |
15,122,285 |
Timestamped JSI web corpus 2021-03 Spanish |
Spanish |
main |
373,185,400 |
Timestamped JSI web corpus 2021-03 Swedish |
Swedish |
main |
22,715,935 |
Timestamped JSI web corpus 2021-04 Arabic |
Arabic |
main |
82,496,710 |
Timestamped JSI web corpus 2021-04 Catalan |
Catalan |
main |
8,926,986 |
Timestamped JSI web corpus 2021-04 Czech |
Czech |
main |
15,095,366 |
Timestamped JSI web corpus 2021-04 Dutch |
Dutch |
main |
23,580,058 |
Timestamped JSI web corpus 2021-04 English |
English |
main |
777,498,417 |
Timestamped JSI web corpus 2021-04 Finnish |
Finnish |
main |
5,624,514 |
Timestamped JSI web corpus 2021-04 French |
French |
main |
113,581,013 |
Timestamped JSI web corpus 2021-04 German |
German |
main |
89,579,085 |
Timestamped JSI web corpus 2021-04 Hebrew |
Hebrew |
main |
6,544,178 |
Timestamped JSI web corpus 2021-04 Hungarian |
Hungarian |
main |
23,392,828 |
Timestamped JSI web corpus 2021-04 Italian |
Italian |
main |
261,813,779 |
Timestamped JSI web corpus 2021-04 Korean |
Korean |
main |
15,506,235 |
Timestamped JSI web corpus 2021-04 Polish |
Polish |
main |
28,676,001 |
Timestamped JSI web corpus 2021-04 Portuguese |
Portuguese |
main |
85,486,841 |
Timestamped JSI web corpus 2021-04 Russian |
Russian |
main |
117,645,204 |
Timestamped JSI web corpus 2021-04 Serbian |
Serbian |
main |
12,237,307 |
Timestamped JSI web corpus 2021-04 Spanish |
Spanish |
main |
289,923,417 |
Timestamped JSI web corpus 2021-04 Swedish |
Swedish |
main |
16,876,787 |
Transhistorical Corpus of Written English |
English |
open |
501,633 |
Turkic web – Azerbaijani |
Azerbaijani |
trial |
94,267,206 |
Turkic web – Kazakh |
Kazakh |
trial |
139,417,763 |
Turkic web – Kyrgyz |
Kyrgyz |
trial |
19,369,507 |
Turkic web – Turkmen |
Turkmen |
trial |
2,105,359 |
Turkic web – Uzbek |
Uzbek |
trial |
18,720,334 |
Turkish Web (trWaC) |
Turkish |
main |
32,791,491 |
Turkish Web 2012 (trTenTen12) |
Turkish |
trial |
3,388,418,900 |
Ukrainian Web 2020 and 2014 (ukTenTen20) |
Ukrainian |
trial |
2,592,516,436 |
UKWaC super sensed |
English |
main |
315,402,632 |
United Nations Parallel Corpus – Arabic |
Arabic |
trial |
545,594,235 |
United Nations Parallel Corpus – Chinese (Simplified) |
Chinese Simplified |
trial |
372,004,482 |
United Nations Parallel Corpus – English |
English |
trial |
664,924,245 |
United Nations Parallel Corpus – French |
French |
trial |
800,980,141 |
United Nations Parallel Corpus – Russian |
Russian |
trial |
529,667,487 |
United Nations Parallel Corpus – Spanish |
Spanish |
trial |
692,809,915 |
Urdu Web (UrduWaC) |
Urdu |
trial |
53,269,273 |
Urdu Web 2018 (urTenTen18) |
Urdu |
trial |
245,656,128 |
Vietnamese Web (VietnameseWaC) |
Vietnamese |
trial |
106,464,835 |
Welsh Web 2013 (WelshWaC) |
Welsh |
trial |
12,458,397 |
Welsh web corpus |
Welsh |
main |
50,392,441 |
Western Frisian Web 2013 (FrisianWaC) |
Frisian |
trial |
3,116,119 |
Yiddish Wikipedia corpus 2018 (yiwiki) |
Yiddish |
trial |
2,106,912 |
Yoruba Web 2015 (YorubaWaC15) |
Yoruba |
trial |
2,816,965 |