| [DEV] Russian Web 2017 (ruTenTen17) |
Russian |
main |
9,034,837,939 |
| ACL Anthology Reference Corpus (ARC) |
English |
open |
62,196,334 |
| Afrikaans Wikipedia corpus 2018 (afwiki) |
Afrikaans |
trial |
14,466,792 |
| American Spanish Web 2011 (esamTenTen11) |
Spanish |
trial |
7,475,579,365 |
| Amharic Web 2013-17 (amWaC17) |
Amharic |
trial |
25,975,846 |
| Arabic Learner Corpus (ALC) |
Arabic |
main |
362,712 |
| Arabic Web 2009 |
Arabic |
main |
150,282,522 |
| Arabic Web 2012 (arTenTen12, Stanford tagger) |
Arabic |
trial |
7,475,624,779 |
| Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) |
Arabic |
main |
115,315,274 |
| Araneum Anglicum Africanum Maius [2015] |
English |
main |
854,484,093 |
| Araneum Anglicum Asiaticum Maius [2015] |
English |
main |
867,259,037 |
| Araneum Anglicum Maius [2015] |
English |
trial |
888,466,066 |
| Araneum Finnicum Maius [2014] |
Finnish |
main |
817,453,523 |
| Araneum Francogallicum Maius [2015] |
French |
main |
933,688,995 |
| Araneum Germanicum Maius [2013] |
German |
main |
875,465,845 |
| Araneum Hispanicum Maius [2013] |
Spanish |
main |
892,299,770 |
| Araneum Hungaricum Maius [2014] |
Hungarian |
trial |
792,549,686 |
| Araneum Italicum Maius (Italian, 14.12) 1,20 G |
Italian |
main |
890,568,533 |
| Araneum Nederlandicum Maius [2013] |
Dutch |
main |
713,417,518 |
| Araneum Polonicum Maius [2013] |
Polish |
main |
595,768,667 |
| Araneum Portugallicum Maius [2015] |
Portuguese |
main |
862,134,902 |
| Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G |
Russian |
trial |
859,319,823 |
| Araneum Slovacum Maius [2013] |
Slovak |
trial |
816,125,010 |
| Basque Web (BasqueWaC v2) |
Basque |
trial |
99,719,584 |
| Belarusian Web 2016 (beTenTen16) |
Belarusian |
trial |
63,327,264 |
| Bengali Web (bnWaC) |
Bengali |
trial |
11,519,730 |
| BIBLE Polish-Swahili |
Polish |
main |
138,216 |
| BIBLE Swahili-Polish |
Swahili |
main |
139,160 |
| Boot Camp English |
English |
trial |
85,683,246 |
| Bosnian Web (bsWaC 1.2) |
Bosnian |
trial |
248,478,730 |
| Brazilian Portuguese corpus (Corpus Brasileiro) |
Portuguese |
main |
871,117,178 |
| Brexit corpus (English) |
English |
trial |
108,452,923 |
| Brexit corpus without retweets (English) |
English |
trial |
4,789,571 |
| British Academic Spoken English Corpus (BASE) |
English |
open |
1,477,281 |
| British Academic Written English Corpus (BAWE) |
English |
open |
6,968,089 |
| British Law Report Corpus |
English |
main |
8,515,749 |
| British National Corpus (BNC) |
English |
trial |
96,134,547 |
| British National Corpus (BNC) 2014 Spoken |
English |
trial |
10,495,185 |
| British National Corpus (BNC), tagged by CLAWS |
English |
trial |
96,052,598 |
| British Web 2007 (ukWaC) |
English |
main |
1,313,058,436 |
| Brown |
English |
open |
1,007,299 |
| Brown Family |
English |
main |
6,963,778 |
| Brown Family, CLAWS + TreeTagger tags |
English |
main |
6,975,474 |
| Bulgarian National Corpus (BulgarianNC) |
Bulgarian |
main |
20,975,703 |
| Bulgarian National Corpus nonweb genres |
Bulgarian |
main |
22,398,507 |
| Bulgarian National Corpus with web |
Bulgarian |
main |
419,512,059 |
| Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) |
Bulgarian |
trial |
705,156,683 |
| Cambridge Academic English |
English |
main |
3,163,648 |
| Cantonese Web (CantoneseWaC) |
Cantonese |
trial |
30,898,663 |
| Catalan Web 2014 (caTenTen14 v2) |
Catalan |
trial |
182,691,653 |
| Cebuano Web 2018 (cebTenTen18) |
Cebuano |
trial |
4,552,105 |
| CHILDES Afrikaans Corpus |
Afrikaans |
main |
26,020 |
| CHILDES Catalan Corpus |
Catalan |
main |
209,525 |
| CHILDES Croatian Corpus |
Croatian |
main |
300,832 |
| CHILDES Danish Corpus |
Danish |
main |
285,231 |
| CHILDES English Corpus |
English |
main |
22,693,506 |
| CHILDES Estonian Corpus |
Estonian |
main |
313,457 |
| CHILDES Farsi Corpus |
Persian |
main |
120,527 |
| CHILDES French Corpus |
French |
main |
2,583,460 |
| CHILDES Gaelic Corpus |
Irish |
main |
16,848 |
| CHILDES German Corpus |
German |
main |
5,941,266 |
| CHILDES Hebrew Corpus |
Hebrew |
main |
807,657 |
| CHILDES Hungarian Corpus |
Hungarian |
main |
247,881 |
| CHILDES Italian Corpus |
Italian |
main |
459,881 |
| CHILDES Japanese Corpus |
Japanese |
main |
1,578,068 |
| CHILDES Korean Corpus |
Korean |
main |
36,056 |
| CHILDES Norwegian Corpus |
Norwegian (Mixed) |
main |
56,827 |
| CHILDES Polish Corpus |
Polish |
main |
1,041,300 |
| CHILDES Portuguese Corpus |
Portuguese |
main |
216,407 |
| CHILDES Russian Corpus |
Russian |
main |
48,791 |
| CHILDES Spanish Corpus |
Spanish |
main |
802,743 |
| CHILDES Swedish Corpus |
Swedish |
main |
520,478 |
| CHILDES Tamil Corpus |
Tamil |
main |
15,490 |
| CHILDES Thai Corpus |
Thai |
main |
243,939 |
| CHILDES Turkish Corpus |
Turkish |
main |
178,100 |
| Chinese GigaWord 2 Corpus: Mainland, simplified |
Chinese Simplified |
main |
205,031,379 |
| Chinese GigaWord 2 Corpus: Taiwan, traditional |
Chinese Traditional |
main |
382,600,557 |
| Chinese Simplified Web 2017 sample |
Chinese Simplified |
trial |
250,361,047 |
| Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) |
Chinese Traditional |
main |
259,156,002 |
| Chinese Traditional Web 2011 (TaiwanWaC) |
Chinese Traditional |
main |
259,156,002 |
| Chinese Traditional Web 2017 (zhTenTen17) sample |
Chinese Traditional |
trial |
239,882,651 |
| Chinese Web 2005 (Internet-ZH, NEUCSP tagger) |
Chinese Simplified |
main |
198,205,344 |
| Chinese Web 2011 (zhTenTen11, sample 10M) |
Chinese Simplified |
main |
9,012,125 |
| Chinese Web 2011 (zhTenTen11, Stanford tagger) |
Chinese Simplified |
trial |
1,729,867,455 |
| Chinese Web 2017 (zhTenTen17) Simplified |
Chinese Simplified |
trial |
13,531,331,169 |
| Chinese Web 2017 (zhTenTen17) Traditional |
Chinese Traditional |
trial |
2,400,405,372 |
| COMPAS 2015 |
English |
access on demand |
114,967,191 |
| COMPAS 2016 |
English |
access on demand |
260,896,404 |
| CoPEP - The Corpus of Portuguese from Academic Journals (v. 1.4) |
Portuguese |
main |
40,423,011 |
| Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ |
N'Ko |
open |
4,102,593 |
| Corpus of Academic Journal Articles (CAJA) |
English |
access on demand |
79,107,410 |
| Corpus of English Dialogues 1560–1760 |
English |
access on demand |
1,151,171 |
| Corpus of Estonian Web sentences 2020 |
Estonian |
main |
280,961,465 |
| Covid-19 |
English |
open |
224,061,570 |
| Croatian Web (hrWaC 2.2, ReLDI) |
Croatian |
trial |
1,210,021,198 |
| Croatian Web (hrWaC 2.2, RFTagger) |
Croatian |
trial |
1,211,328,660 |
| csSkELL v1 (whole documents) |
Czech |
main |
1,717,516,129 |
| csSkELL v2.2 (sentences with GDEX scores) |
Czech |
main |
1,443,410,941 |
| Cundeelee Wangka Stories (Cundeelee Wangka) |
Cundeelee Wangka |
access on demand |
1,965 |
| Cundeelee Wangka Stories (English) |
English |
access on demand |
4,423 |
| Czech news and web 1995–2002 (czes2.2) |
Czech |
main |
366,796,757 |
| Czech Web 2017 (csTenTen17) |
Czech |
trial |
10,502,222,474 |
| Czech Web 2017 sample |
Czech |
trial |
249,877,322 |
| CzechParl 2012 (v2 with lempos) |
Czech |
main |
37,184,025 |
| Danish Web 2010 (DanishWaC) |
Danish |
main |
288,272,967 |
| Danish Web 2014 (daTenTen14) |
Danish |
main |
2,040,976,501 |
| Danish Web 2017 (daTenTen17) |
Danish |
trial |
2,170,690,492 |
| Danish Web 2017 sample |
Danish |
trial |
214,447,970 |
| DGT, Bulgarian |
Bulgarian |
main |
25,912,721 |
| DGT, Croatian |
Croatian |
main |
3,968,608 |
| DGT, Czech |
Czech |
main |
43,621,933 |
| DGT, Danish |
Danish |
main |
44,962,280 |
| DGT, Dutch |
Dutch |
main |
50,523,892 |
| DGT, English |
English |
main |
59,106,576 |
| DGT, Estonian |
Estonian |
main |
34,155,488 |
| DGT, Finnish |
Finnish |
main |
35,129,923 |
| DGT, French |
French |
main |
58,224,781 |
| DGT, German |
German |
main |
45,380,666 |
| DGT, Greek |
Greek |
main |
51,865,988 |
| DGT, Hungarian |
Hungarian |
main |
2,306,272 |
| DGT, Irish |
Irish |
main |
1,065,421 |
| DGT, Italian |
Italian |
main |
53,260,912 |
| DGT, Latvian |
Latvian |
main |
38,898,134 |
| DGT, Lithuanian |
Lithuanian |
main |
38,675,242 |
| DGT, Maltese |
Maltese |
main |
22,388,562 |
| DGT, Polish |
Polish |
main |
44,149,107 |
| DGT, Portuguese |
Portuguese |
main |
53,950,705 |
| DGT, Romanian |
Romanian |
main |
26,644,734 |
| DGT, Slovak |
Slovak |
main |
43,276,048 |
| DGT, Slovenian |
Slovenian |
main |
42,897,385 |
| DGT, Spanish |
Spanish |
main |
57,311,149 |
| DGT, Swedish |
Swedish |
main |
44,378,725 |
| Dutch Web 2014 (nlTenTen14) |
Dutch |
trial |
2,253,777,579 |
| Dutch Web 2014 sample |
Dutch |
trial |
250,219,005 |
| e-flux (International art English) |
English |
main |
5,036,119 |
| EcoLexicon English (Environment) |
English |
open |
23,169,446 |
| English Broadsheet Newspapers 1993–2013 (SiBol with trends) |
English |
main |
654,435,535 |
| English Corpus for SkELL 3.10 |
English |
main |
1,038,200,313 |
| English Corpus for SkELL 3.8 |
English |
main |
1,041,772,774 |
| English Corpus for SkELL 3.9 |
English |
main |
1,041,138,575 |
| English Historical Book Collection (EEBO, ECCO, Evans) |
English |
main |
826,296,048 |
| English Preposition Corpus |
English |
trial |
2,136,325 |
| English Web 2008 (enTenTen08) |
English |
main |
2,759,340,513 |
| English Web 2012 (enTenTen12) |
English |
main |
11,191,860,036 |
| English Web 2013 (enTenTen13) |
English |
trial |
19,685,733,337 |
| English Web 2013 sample |
English |
trial |
204,976,089 |
| English Web 2015 (enTenTen15) |
English |
trial |
13,190,556,334 |
| English Wikipedia |
English |
main |
1,356,523,079 |
| English Wikipedia sample with Error annotations |
English |
trial |
951,824 |
| Estonian Corpus for Learners 2020 (etSkELL) |
Estonian |
main |
280,572,215 |
| Estonian National Corpus 2019 (Estonian NC 2019) |
Estonian |
trial |
1,500,284,681 |
| Estonian Reference corpus 1990-2008 (EstonianRC) |
Estonian |
main |
203,267,951 |
| Estonian Web 2013 (etTenTen13) |
Estonian |
trial |
260,559,829 |
| EUR-Lex Bulgarian 2/2016 |
Bulgarian |
trial |
329,071,554 |
| EUR-Lex Croatian 2/2016 |
Croatian |
trial |
109,138,184 |
| EUR-Lex Czech 2/2016 |
Czech |
trial |
350,230,088 |
| EUR-Lex Danish 2/2016 |
Danish |
trial |
519,765,085 |
| EUR-Lex Dutch 2/2016 |
Dutch |
trial |
583,263,688 |
| EUR-Lex English 2/2016 |
English |
trial |
629,722,593 |
| EUR-Lex Estonian 2/2016 |
Estonian |
trial |
291,077,511 |
| EUR-Lex Finnish 2/2016 |
Finnish |
trial |
384,119,975 |
| EUR-Lex French 2/2016 |
French |
trial |
677,063,993 |
| EUR-Lex German 2/2016 |
German |
trial |
528,617,843 |
| EUR-Lex Greek 2/2016 |
Greek |
trial |
579,344,223 |
| EUR-Lex Hungarian 2/2016 |
Hungarian |
trial |
340,618,970 |
| EUR-Lex Irish 2/2016 |
Irish |
trial |
31,439,542 |
| EUR-Lex Italian 2/2016 |
Italian |
trial |
606,070,097 |
| EUR-Lex judgments Bulgarian 12/2016 |
Bulgarian |
trial |
17,071,495 |
| EUR-Lex judgments Croatian 12/2016 |
Croatian |
trial |
5,613,468 |
| EUR-Lex judgments Czech 12/2016 |
Czech |
trial |
18,226,505 |
| EUR-Lex judgments Danish 12/2016 |
Danish |
trial |
34,934,021 |
| EUR-Lex judgments Dutch 12/2016 |
Dutch |
trial |
40,534,071 |
| EUR-Lex judgments English 12/2016 |
English |
trial |
42,339,337 |
| EUR-Lex judgments Estonian 12/2016 |
Estonian |
trial |
15,029,608 |
| EUR-Lex judgments Finnish 12/2016 |
Finnish |
trial |
23,601,422 |
| EUR-Lex judgments French 12/2016 |
French |
trial |
48,023,524 |
| EUR-Lex judgments German 12/2016 |
German |
trial |
35,297,517 |
| EUR-Lex judgments Greek 12/2016 |
Greek |
trial |
35,815,108 |
| EUR-Lex judgments Hungarian 12/2016 |
Hungarian |
trial |
17,940,879 |
| EUR-Lex judgments Italian 12/2016 |
Italian |
trial |
42,053,315 |
| EUR-Lex judgments Latvian 12/2016 |
Latvian |
trial |
16,908,831 |
| EUR-Lex judgments Lithuanian 12/2016 |
Lithuanian |
trial |
16,252,111 |
| EUR-Lex judgments Maltese 12/2016 |
Maltese |
trial |
19,146,797 |
| EUR-Lex judgments Polish 12/2016 |
Polish |
trial |
18,799,551 |
| EUR-Lex judgments Portuguese 12/2016 |
Portuguese |
trial |
35,412,936 |
| EUR-Lex judgments Romanian 12/2016 |
Romanian |
trial |
17,592,388 |
| EUR-Lex judgments Slovak 12/2016 |
Slovak |
trial |
18,265,664 |
| EUR-Lex judgments Slovenian 12/2016 |
Slovenian |
trial |
18,439,766 |
| EUR-Lex judgments Spanish 12/2016 |
Spanish |
trial |
39,431,836 |
| EUR-Lex judgments Swedish 12/2016 |
Swedish |
trial |
30,666,764 |
| EUR-Lex Latvian 2/2016 |
Latvian |
trial |
324,734,544 |
| EUR-Lex Lithuanian 2/2016 |
Lithuanian |
trial |
323,151,426 |
| EUR-Lex Maltese 2/2016 |
Maltese |
trial |
314,396,006 |
| EUR-Lex Polish 2/2016 |
Polish |
trial |
360,862,149 |
| EUR-Lex Portuguese 2/2016 |
Portuguese |
trial |
595,066,701 |
| EUR-Lex Romanian 2/2016 |
Romanian |
trial |
336,928,068 |
| EUR-Lex Slovak 2/2016 |
Slovak |
trial |
255,531,673 |
| EUR-Lex Slovenian 2/2016 |
Slovenian |
trial |
351,899,258 |
| EUR-Lex Spanish 2/2016 |
Spanish |
trial |
635,187,126 |
| EUR-Lex Swedish 2/2016 |
Swedish |
trial |
478,485,126 |
| EUROPARL7, Bulgarian |
Bulgarian |
trial |
9,215,233 |
| EUROPARL7, Czech |
Czech |
trial |
13,013,774 |
| EUROPARL7, Danish |
Danish |
trial |
48,343,860 |
| EUROPARL7, Dutch |
Dutch |
trial |
54,007,722 |
| EUROPARL7, English |
English |
trial |
53,837,625 |
| EUROPARL7, Estonian |
Estonian |
trial |
11,171,727 |
| EUROPARL7, Finnish |
Finnish |
trial |
34,182,031 |
| EUROPARL7, French |
French |
trial |
59,145,988 |
| EUROPARL7, German |
German |
trial |
47,805,055 |
| EUROPARL7, Greek |
Greek |
trial |
38,868,863 |
| EUROPARL7, Hungarian |
Hungarian |
trial |
12,421,715 |
| EUROPARL7, Italian |
Italian |
trial |
52,871,060 |
| EUROPARL7, Latvian |
Latvian |
trial |
11,920,085 |
| EUROPARL7, Lithuanian |
Lithuanian |
trial |
11,424,032 |
| EUROPARL7, Polish |
Polish |
trial |
13,034,164 |
| EUROPARL7, Portuguese |
Portuguese |
trial |
53,778,766 |
| EUROPARL7, Romanian |
Romanian |
trial |
9,554,864 |
| EUROPARL7, Slovak |
Slovak |
trial |
12,942,651 |
| EUROPARL7, Slovenian |
Slovenian |
trial |
12,496,942 |
| EUROPARL7, Spanish |
Spanish |
trial |
54,302,284 |
| EUROPARL7, Swedish |
Swedish |
trial |
46,303,799 |
| European Spanish Web 2011 (eseuTenTen11) |
Spanish |
trial |
2,021,633,644 |
| Finnish Web 2014 (fiTenTen14) |
Finnish |
trial |
1,404,083,812 |
| Finnish Web 2014 (fiTenTen14, TreeTagger v2) |
Finnish |
main |
1,404,100,049 |
| Finnish Web 2014 sample (fiTenTen14, TreeTagger v2) |
Finnish |
trial |
40,756,118 |
| Frantext (French literature of the 18th-20th century) |
French |
main |
15,573,070 |
| Frantext (French literature of the 18th-20th century), without trends |
French |
main |
15,573,070 |
| French corpus of 88,000 SMS (88milSMS) |
French |
trial |
1,206,663 |
| French Web 2008 (v2 with lempos) |
French |
main |
104,705,211 |
| French Web 2010 (frWaC) |
French |
main |
1,330,564,200 |
| French Web 2012 (frTenTen12) |
French |
trial |
9,889,689,889 |
| French Web 2012 sample |
French |
trial |
205,185,797 |
| French Web 2017 (frTenTen17) |
French |
trial |
5,752,261,039 |
| French Web 2017 sample |
French |
trial |
404,555,405 |
| Georgian Web 2013 (kaWaC) |
Georgian |
trial |
50,713,604 |
| German Corpus for SkELL 1.0 |
German |
main |
769,810,745 |
| German Political Speeches Corpus |
German |
trial |
11,144,258 |
| German Web 2010 |
German |
main |
2,338,036,362 |
| German Web 2010 (deWaC) |
German |
main |
1,348,188,416 |
| German Web 2013 (deTenTen13) |
German |
trial |
16,526,335,416 |
| German Web 2013 sample |
German |
trial |
193,838,751 |
| GerManC (German Newspapers 1650-1800) |
German |
main |
667,310 |
| Gigafida v2.0 (referenčni) |
Slovenian |
main |
1,109,441,592 |
| Greek Web (GkWaC with lempos) |
Greek |
main |
124,285,612 |
| Greek Web 2014 (elTenTen14) |
Greek |
trial |
1,671,692,845 |
| Guangwai - Lancaster Chinese Learner Corpus |
Chinese Simplified |
open |
1,289,060 |
| Gujarati Web (guWaC) |
Gujarati |
trial |
17,960,095 |
| Hausa Web 2015 (hausaWaC15) |
Hausa (Boko) |
trial |
5,304,300 |
| Hebrew General Corpus (web crawled, mostly newspapers) |
Hebrew |
main |
157,947,728 |
| Hebrew Web (HebWaC) |
Hebrew |
main |
47,832,254 |
| Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) |
Hebrew |
access on demand |
895,876,116 |
| Hebrew Web 2014 (heTenTen14, no POS tagging) |
Hebrew |
trial |
890,282,843 |
| Hindi Web 2012 (HindiWaC v. 4) |
Hindi |
trial |
107,960,109 |
| Hindi Web 2013 (hiTenTen13) |
Hindi |
main |
351,289,441 |
| Hungarian Web 2012 (huTenTen12) |
Hungarian |
trial |
2,572,620,694 |
| Icelandic texts [sample] |
Icelandic |
trial |
5,436,035 |
| Igbo Web 2015 (IgboWaC15) |
Igbo |
trial |
331,042 |
| Indonesian Web (IndonesianWaC) |
Indonesian |
trial |
90,120,046 |
| Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) |
Irish |
open |
314,807 |
| Italian Corpus for SkELL 1.0 |
Italian |
main |
328,270,600 |
| Italian Web 2006 (itWaC) |
Italian |
main |
1,597,295,469 |
| Italian Web 2010 (itTenTen) |
Italian |
main |
2,588,873,046 |
| Italian Web 2016 (itTenTen16) |
Italian |
trial |
4,989,729,171 |
| Italian Web 2016 sample |
Italian |
trial |
201,204,942 |
| itWAC (reduced) |
Italian |
main |
751,542,948 |
| Japanese Web 2006 (jpWaC) |
Japanese |
main |
336,867,039 |
| Japanese Web 2011 (jaTenTen11) |
Japanese |
trial |
8,432,256,578 |
| Japanese Web 2011 (jaTenTen11, sample) |
Japanese |
main |
301,407,652 |
| Japanese Web 2011 sample (jaTenTen11, LUW) |
Japanese |
trial |
163,837,671 |
| Kannada Web 2012 (knWaC12) |
Kannada |
trial |
11,056,526 |
| KAS-Dipl (diplome) |
Slovenian |
main |
568,188,810 |
| KAS-Dr (doktorati) |
Slovenian |
main |
30,244,519 |
| KAS-Mag (magisteriji) |
Slovenian |
main |
157,168,378 |
| Khmer Web 2018 (kmTenTen18) |
Khmer |
trial |
16,500,379 |
| Korean 2018 term reference corpus (koTenTen18_term_ref) |
Korean |
trial |
83,749,660 |
| Korean Web 2012 (koTenTen12) |
Korean |
main |
461,196,240 |
| Korean Web 2018 (koTenTen18) |
Korean |
trial |
1,668,851,720 |
| KSUCCA (Classical Arabic) |
Arabic |
main |
46,705,577 |
| Lao Web 2018 (loTenTen18) |
Lao |
trial |
15,862,991 |
| Lao Web 2019 (loTenTen19) |
Lao |
trial |
105,018,584 |
| LatinISE corpus |
Latin |
trial |
11,202,216 |
| Latvian Web (LatvianWaC) |
Latvian |
main |
57,666,024 |
| Latvian Web 2014 (lvTenTen14) |
Latvian |
trial |
530,367,474 |
| Lektor (Learner corpus of proofread and translations) |
Slovenian |
main |
953,038 |
| LEXMCI |
English |
main |
1,448,180,339 |
| Lithuanian Web (LithuanianWaC v2) |
Lithuanian |
main |
48,650,918 |
| Lithuanian Web 2014 (ltTenTen14) |
Lithuanian |
trial |
778,151,979 |
| MagyarOK teaching materials for Hungarian, levels A1 to B2 |
Hungarian |
open |
144,832 |
| Malayalam Web (malayalamWaC) |
Malayalam |
trial |
15,950,663 |
| Malaysian Web (MalaysianWaC) |
Malay |
trial |
182,578,743 |
| Maldivian Wikipedia corpus 2019 (dvwiki) |
Maldivian |
trial |
548,211 |
| Maltese MLRS Corpus |
Maltese |
trial |
110,714,844 |
| Maori Web 2013 and 2020 (miTenTen20) |
Maori |
trial |
11,814,825 |
| Medical Web Corpus |
English |
main |
33,961,786 |
| Mongolian Web Texts 2016 (mnWaC16) |
Mongolian |
trial |
6,104,565 |
| Multicultural London English Corpus |
English |
main |
2,391,040 |
| Nepali National Corpus |
Nepali |
trial |
13,440,835 |
| Nepali Web (NepaliWaC) |
Nepali |
main |
1,290,388 |
| New corpus for English (NCI English) |
English |
main |
217,548,758 |
| New Model Corpus |
English |
main |
95,276,958 |
| Newspapers in Portuguese (CetemPúblico, CetenFolha) |
Portuguese |
main |
56,768,822 |
| Norwegian dictionary corpus (Nynorskkorpuset) |
Norwegian (Mixed) |
main |
74,496,664 |
| Norwegian Web 2012 |
Norwegian (Mixed) |
main |
669,511,569 |
| Norwegian Web 2017 (noTenTen17, Bokmål) |
Norwegian Bokmål |
trial |
2,472,483,911 |
| Norwegian Web 2017 (noTenTen17, Nynorsk) |
Norwegian Nynorsk |
trial |
174,830,652 |
| Norwegian Web 2017 sample (Bokmål) |
Norwegian Bokmål |
trial |
58,955,519 |
| Norwegian Web 2017 sample (Nynorsk) |
Norwegian Nynorsk |
trial |
58,743,828 |
| OEC |
English |
access on demand |
2,073,319,589 |
| OEC v2 |
English |
access on demand |
2,073,563,928 |
| Open Access Journals (DOAJ - English) |
English |
trial |
2,662,763,697 |
| Open American National Corpus (spoken) |
English |
main |
3,202,026 |
| Open American National Corpus (written) |
English |
main |
11,048,137 |
| Open Cambridge Learner Corpus (Uncoded) |
English |
access on demand |
2,975,701 |
| Opus MontenegrinSubs: English |
English |
trial |
468,337 |
| Opus MontenegrinSubs: Montenegrin |
Montenegrin |
trial |
365,698 |
| OPUS2 Afrikaans |
Afrikaans |
main |
586,334 |
| OPUS2 Albanian |
Albanian |
trial |
46,304,346 |
| OPUS2 Arabic |
Arabic |
main |
300,000,057 |
| OPUS2 Bosnian |
Bosnian |
main |
43,582,516 |
| OPUS2 Brazilian Portuguese |
Portuguese |
main |
272,300,927 |
| OPUS2 Bulgarian |
Bulgarian |
main |
183,115,244 |
| OPUS2 Chinese Simplified |
Chinese Simplified |
main |
243,427,123 |
| OPUS2 Chinese Traditional |
Chinese Traditional |
main |
380,245 |
| OPUS2 Croatian |
Croatian |
main |
121,369,625 |
| OPUS2 Czech |
Czech |
main |
203,845,619 |
| OPUS2 Danish |
Danish |
main |
120,107,271 |
| OPUS2 Dutch |
Dutch |
main |
356,363,571 |
| OPUS2 English |
English |
main |
1,139,515,048 |
| OPUS2 Estonian |
Estonian |
main |
64,879,741 |
| OPUS2 Finnish |
Finnish |
main |
131,985,872 |
| OPUS2 French |
French |
main |
766,833,908 |
| OPUS2 German |
German |
main |
125,229,773 |
| OPUS2 Greek |
Greek |
main |
239,360,926 |
| OPUS2 Hebrew |
Hebrew |
main |
130,972,343 |
| OPUS2 Hindi |
Hindi |
main |
854,741 |
| OPUS2 Hungarian |
Hungarian |
main |
157,495,018 |
| OPUS2 Italian |
Italian |
main |
180,532,849 |
| OPUS2 Japanese |
Japanese |
main |
5,455,106 |
| OPUS2 Korean |
Korean |
main |
374,850 |
| OPUS2 Latvian |
Latvian |
main |
24,499,516 |
| OPUS2 Lithuanian |
Lithuanian |
main |
29,621,940 |
| OPUS2 Macedonian |
Macedonian |
trial |
40,348,792 |
| OPUS2 Norwegian |
Norwegian (Mixed) |
main |
20,237,510 |
| OPUS2 Persian |
Persian |
trial |
4,425,133 |
| OPUS2 Polish |
Polish |
main |
208,008,636 |
| OPUS2 Portuguese |
Portuguese |
main |
297,700,205 |
| OPUS2 Romanian |
Romanian |
main |
282,408,295 |
| OPUS2 Russian |
Russian |
main |
307,709,872 |
| OPUS2 Serbian |
Serbian |
main |
153,237,786 |
| OPUS2 Slovak |
Slovak |
main |
62,451,407 |
| OPUS2 Slovenian |
Slovenian |
main |
121,228,966 |
| OPUS2 Spanish |
Spanish |
main |
111,497 |
| OPUS2 Swedish |
Swedish |
main |
102,298,686 |
| OPUS2 Turkish |
Turkish |
main |
151,342,424 |
| OPUS2 Ukrainian |
Ukrainian |
main |
2,578,289 |
| Oromo Web 2016 (orWaC16) |
Oromo |
trial |
4,249,953 |
| Oxford Children's Corpus 2015 (PTag) |
English |
access on demand |
210,322,185 |
| Oxford Children's Corpus 2015 -- Education (PTag) |
English |
access on demand |
1,323,174 |
| Oxford Children's Corpus 2015 -- Reading (PTag) |
English |
access on demand |
34,284,687 |
| Oxford Children's Corpus 2015 -- Writing (PTag) |
English |
access on demand |
174,714,324 |
| Oxford Children's Corpus 2016 (PTag) |
English |
access on demand |
284,360,063 |
| Oxford Children's Corpus 2016 -- Reading (PTag) |
English |
access on demand |
53,858,955 |
| Oxford Children's Corpus 2016 -- Writing (PTag) |
English |
access on demand |
229,177,934 |
| Oxford Corpus of Academic English (April 2012) |
English |
access on demand |
71,372,972 |
| Paisa |
Italian |
main |
221,989,288 |
| Parsed German Web (sDeWaC) |
German |
main |
755,165,551 |
| Penn Corpora of Historical English |
English |
access on demand |
3,800,639 |
| PICAE 2010 |
English |
access on demand |
31,025,920 |
| Polish Web (PolishWac, Morfeusz and TaKIPI tagger) |
Polish |
main |
103,028,410 |
| Polish Web 2012 (plTenTen12, RFTagger) |
Polish |
trial |
7,715,835,214 |
| Polish Web 2012 sample |
Polish |
trial |
191,648,244 |
| Portuguese Web 2011 (ptTenTen11) |
Portuguese |
trial |
3,896,392,719 |
| Portuguese Web 2011 (ptTenTen11, Palavras parsed) |
Portuguese |
main |
2,757,635,105 |
| Portuguese Web 2011 sample |
Portuguese |
trial |
202,548,549 |
| Project Gutenberg English |
English |
main |
443,471,071 |
| pukWaC (ukWaC parsed with MaltParser) |
English |
main |
39,502,648 |
| Quran annotated corpus [unvowelled Arabic] |
Arabic |
main |
128,243 |
| Quran annotated corpus [unvowelled Latin] |
Arabic |
main |
99,268 |
| Quran annotated corpus [vowelled Arabic] |
Arabic |
main |
128,241 |
| Quran annotated corpus [vowelled Latin] |
Arabic |
main |
97,970 |
| RapCor1288 - Francophone rap songs |
French |
trial |
709,057 |
| Riznica v0.1 |
Croatian |
main |
85,273,724 |
| Romanian Web 2016 (roTenTen16) |
Romanian |
trial |
2,640,496,763 |
| ruSkELL 1.6 |
Russian |
main |
975,584,449 |
| Russian Web 2006 (v2 with lempos) |
Russian |
main |
147,930,261 |
| Russian Web 2011 (ruTenTen11) |
Russian |
trial |
14,553,856,113 |
| Russian Web 2011 sample (ruTenTen11) |
Russian |
trial |
998,099,963 |
| Samoan Web (SamoanWac1) |
Samoan |
trial |
3,115,385 |
| ScienceBlogs |
English |
main |
103,175,233 |
| Scottish Gaelic Wiki 2015 (gdWiki) |
Scottish Gaelic |
trial |
980,026 |
| Semcor v3.0 (sense-tagged corpus) |
English |
main |
664,038 |
| Serbian Web (srWaC 1.2 processed by Hunpos) |
Serbian |
trial |
477,724,164 |
| Serbian Web (srWaC 1.2 processed by RFTagger v1) |
Serbian (Latin) |
trial |
441,888,202 |
| Serbian Web (srWaC 1.2) |
Serbian (Latin) |
trial |
476,888,297 |
| Setswana/Tswana Web (SetswanaWaC v2) |
Setswana |
trial |
11,496,687 |
| Slovak Web 2011 (skTenTen11) |
Slovak |
trial |
540,112,634 |
| Slovak Web 2011 (skTenTen11, ambiguity tag, lempos) |
Slovak |
main |
715,707,053 |
| Slovak Web 2011 sample |
Slovak |
trial |
189,609,195 |
| Slovenian reference corpus (FidaPLUS v2) |
Slovenian |
trial |
600,309,670 |
| Slovenian Web (slWaC 2.1 processed with TreeTagger v2) |
Slovenian |
trial |
755,255,547 |
| Slovenian Web (slWaC 2.1) |
Slovenian |
trial |
754,255,589 |
| Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
Slovenian |
trial |
829,544,337 |
| Slovenian Web 2015 sample |
Slovenian |
trial |
195,792,821 |
| Somali Web 2016 (soWaC16) |
Somali |
trial |
71,871,585 |
| SoNaR |
Dutch |
access on demand |
425,978,755 |
| Spanish Web 2005 (SpanishWaC) |
Spanish |
main |
97,773,185 |
| Spanish Web 2011 (esTenTen11, Eu + Am) |
Spanish |
trial |
9,497,213,009 |
| Spanish Web 2011 sample |
Spanish |
trial |
212,142,794 |
| Spanish Web 2018 (esTenTen18) |
Spanish |
trial |
17,553,075,259 |
| Spanish Web 2018 sample |
Spanish |
trial |
177,257,648 |
| Susanne |
English |
trial |
128,998 |
| Swahili Web 2014 (SwahiliWaC) |
Swahili |
trial |
17,882,483 |
| Swedish Web 2014 (svTenTen14) |
Swedish |
trial |
3,401,035,817 |
| Swedish Web 2014 sample |
Swedish |
trial |
45,477,881 |
| SwedishParole |
Swedish |
main |
21,735,113 |
| Tagalog (Filipino) Web 2019 (tlTenTen19) |
Tagalog |
trial |
197,908,842 |
| Tajik Web (TajikWaC) |
Tajik |
trial |
93,151,897 |
| TalkBank Persian (blog posts) |
Persian |
main |
474,773,547 |
| Tamil Web 2015 (TamilWaC) |
Tamil |
trial |
26,750,515 |
| Tatar Mixed Corpus |
Tatar |
trial |
102,779,803 |
| Tatar News (2000-2014), version with lempos |
Tatar |
main |
24,927,439 |
| Tatar Web 2015 sample |
Tatar |
trial |
195,901 |
| Ted Talks transcripts |
English |
main |
2,882,085 |
| Telugu Web 2017 (teTenTen) |
Telugu |
trial |
126,807,158 |
| Thai Web (ThaiWaC) |
Thai |
trial |
82,787,119 |
| Thai Web 2018 (thTenTen18) |
Thai |
trial |
640,530,227 |
| The Annotated Corpus of Classical Tibetan (ACTib 2.0) |
Tibetan |
trial |
170,202,078 |
| The New Corpus for Ireland |
Irish |
main |
29,886,201 |
| Tigrinya Web 2016 (tiWaC16) |
Tigrinya |
trial |
2,087,613 |
| Timestamped JSI web corpus 2014-2016 Arabic |
Arabic |
trial |
976,573,611 |
| Timestamped JSI web corpus 2014-2016 Catalan |
Catalan |
trial |
99,395,494 |
| Timestamped JSI web corpus 2014-2016 Czech |
Czech |
trial |
289,488,005 |
| Timestamped JSI web corpus 2014-2016 Dutch |
Dutch |
trial |
401,347,934 |
| Timestamped JSI web corpus 2014-2016 English |
English |
trial |
18,315,071,361 |
| Timestamped JSI web corpus 2014-2016 Finnish |
Finnish |
trial |
119,109,490 |
| Timestamped JSI web corpus 2014-2016 French |
French |
trial |
1,870,341,756 |
| Timestamped JSI web corpus 2014-2016 German |
German |
trial |
1,987,759,563 |
| Timestamped JSI web corpus 2014-2016 Hebrew |
Hebrew |
trial |
111,339,363 |
| Timestamped JSI web corpus 2014-2016 Hungarian |
Hungarian |
trial |
180,843,359 |
| Timestamped JSI web corpus 2014-2016 Italian |
Italian |
trial |
1,375,907,374 |
| Timestamped JSI web corpus 2014-2016 Korean |
Korean |
trial |
438,816,127 |
| Timestamped JSI web corpus 2014-2016 Polish |
Polish |
trial |
157,930,228 |
| Timestamped JSI web corpus 2014-2016 Portuguese |
Portuguese |
trial |
1,109,771,393 |
| Timestamped JSI web corpus 2014-2016 Russian |
Russian |
trial |
1,120,731,416 |
| Timestamped JSI web corpus 2014-2016 Serbian |
Serbian |
trial |
86,380,673 |
| Timestamped JSI web corpus 2014-2016 Spanish |
Spanish |
trial |
4,055,944,612 |
| Timestamped JSI web corpus 2014-2016 Swedish |
Swedish |
trial |
335,782,681 |
| Timestamped JSI web corpus 2014-2020 Arabic |
Arabic |
main |
4,121,147,715 |
| Timestamped JSI web corpus 2014-2020 Catalan |
Catalan |
main |
373,235,642 |
| Timestamped JSI web corpus 2014-2020 Czech |
Czech |
main |
901,794,639 |
| Timestamped JSI web corpus 2014-2020 Dutch |
Dutch |
main |
1,181,836,141 |
| Timestamped JSI web corpus 2014-2020 English |
English |
main |
53,106,755,084 |
| Timestamped JSI web corpus 2014-2020 Finnish |
Finnish |
main |
369,454,982 |
| Timestamped JSI web corpus 2014-2020 French |
French |
main |
5,982,741,890 |
| Timestamped JSI web corpus 2014-2020 German |
German |
main |
6,194,176,109 |
| Timestamped JSI web corpus 2014-2020 Hebrew |
Hebrew |
main |
406,351,360 |
| Timestamped JSI web corpus 2014-2020 Hungarian |
Hungarian |
main |
714,951,341 |
| Timestamped JSI web corpus 2014-2020 Italian |
Italian |
main |
6,509,458,717 |
| Timestamped JSI web corpus 2014-2020 Korean |
Korean |
main |
1,438,494,218 |
| Timestamped JSI web corpus 2014-2020 Polish |
Polish |
main |
729,292,544 |
| Timestamped JSI web corpus 2014-2020 Portuguese |
Portuguese |
main |
3,957,241,843 |
| Timestamped JSI web corpus 2014-2020 Russian |
Russian |
main |
4,791,961,483 |
| Timestamped JSI web corpus 2014-2020 Serbian |
Serbian |
main |
466,051,344 |
| Timestamped JSI web corpus 2014-2020 Spanish |
Spanish |
main |
13,834,261,153 |
| Timestamped JSI web corpus 2014-2020 Swedish |
Swedish |
main |
1,007,079,426 |
| Timestamped JSI web corpus 2020-09 Arabic |
Arabic |
main |
93,839,059 |
| Timestamped JSI web corpus 2020-09 Catalan |
Catalan |
main |
9,114,479 |
| Timestamped JSI web corpus 2020-09 Czech |
Czech |
main |
16,500,590 |
| Timestamped JSI web corpus 2020-09 Dutch |
Dutch |
main |
27,350,237 |
| Timestamped JSI web corpus 2020-09 English |
English |
main |
944,265,733 |
| Timestamped JSI web corpus 2020-09 Finnish |
Finnish |
main |
7,165,935 |
| Timestamped JSI web corpus 2020-09 French |
French |
main |
133,128,037 |
| Timestamped JSI web corpus 2020-09 German |
German |
main |
119,113,152 |
| Timestamped JSI web corpus 2020-09 Hebrew |
Hebrew |
main |
7,962,757 |
| Timestamped JSI web corpus 2020-09 Hungarian |
Hungarian |
main |
21,325,758 |
| Timestamped JSI web corpus 2020-09 Italian |
Italian |
main |
251,646,734 |
| Timestamped JSI web corpus 2020-09 Korean |
Korean |
main |
19,413,863 |
| Timestamped JSI web corpus 2020-09 Polish |
Polish |
main |
29,946,442 |
| Timestamped JSI web corpus 2020-09 Portuguese |
Portuguese |
main |
96,906,119 |
| Timestamped JSI web corpus 2020-09 Russian |
Russian |
main |
133,493,258 |
| Timestamped JSI web corpus 2020-09 Serbian |
Serbian |
main |
12,175,985 |
| Timestamped JSI web corpus 2020-09 Spanish |
Spanish |
main |
325,029,575 |
| Timestamped JSI web corpus 2020-09 Swedish |
Swedish |
main |
20,294,470 |
| Timestamped JSI web corpus 2020-10 Arabic |
Arabic |
main |
96,538,837 |
| Timestamped JSI web corpus 2020-10 Catalan |
Catalan |
main |
9,685,481 |
| Timestamped JSI web corpus 2020-10 Czech |
Czech |
main |
17,378,113 |
| Timestamped JSI web corpus 2020-10 Dutch |
Dutch |
main |
30,202,034 |
| Timestamped JSI web corpus 2020-10 English |
English |
main |
986,590,708 |
| Timestamped JSI web corpus 2020-10 Finnish |
Finnish |
main |
7,660,361 |
| Timestamped JSI web corpus 2020-10 French |
French |
main |
138,015,892 |
| Timestamped JSI web corpus 2020-10 German |
German |
main |
127,987,516 |
| Timestamped JSI web corpus 2020-10 Hebrew |
Hebrew |
main |
8,401,215 |
| Timestamped JSI web corpus 2020-10 Hungarian |
Hungarian |
main |
22,408,596 |
| Timestamped JSI web corpus 2020-10 Italian |
Italian |
main |
259,816,566 |
| Timestamped JSI web corpus 2020-10 Korean |
Korean |
main |
19,346,769 |
| Timestamped JSI web corpus 2020-10 Polish |
Polish |
main |
32,034,885 |
| Timestamped JSI web corpus 2020-10 Portuguese |
Portuguese |
main |
101,374,205 |
| Timestamped JSI web corpus 2020-10 Russian |
Russian |
main |
138,972,026 |
| Timestamped JSI web corpus 2020-10 Serbian |
Serbian |
main |
13,713,045 |
| Timestamped JSI web corpus 2020-10 Spanish |
Spanish |
main |
340,052,637 |
| Timestamped JSI web corpus 2020-10 Swedish |
Swedish |
main |
21,327,238 |
| Turkic web – Azerbaijani |
Azerbaijani |
trial |
94,267,206 |
| Turkic web – Kazakh |
Kazakh |
trial |
139,417,763 |
| Turkic web – Kyrgyz |
Kyrgyz |
trial |
19,369,507 |
| Turkic web – Turkmen |
Turkmen |
trial |
2,105,359 |
| Turkic web – Uzbek |
Uzbek |
trial |
18,720,334 |
| Turkish Web (trWaC) |
Turkish |
main |
32,791,491 |
| Turkish Web 2012 (trTenTen12) |
Turkish |
trial |
3,388,418,900 |
| Ukrainian Web 2014 (ukTenTen14) |
Ukrainian |
trial |
2,194,447,594 |
| UKWaC super sensed |
English |
main |
315,402,632 |
| Urdu Web (UrduWaC) |
Urdu |
trial |
53,269,273 |
| Urdu Web 2018 (urTenTen18) |
Urdu |
trial |
245,656,128 |
| Vietnamese Web (VietnameseWaC) |
Vietnamese |
trial |
106,464,835 |
| Welsh Web 2013 (WelshWaC) |
Welsh |
trial |
12,458,397 |
| Welsh web corpus |
Welsh |
main |
50,392,441 |
| Western Frisian Web 2013 (FrisianWaC) |
Frisian |
trial |
3,116,119 |
| Yiddish Wikipedia corpus 2018 (yiwiki) |
Yiddish |
trial |
2,106,912 |
| Yoruba Web 2015 (YorubaWaC15) |
Yoruba |
trial |
2,816,965 |