[DEV] Russian Web 2017 (ruTenTen17) |
Russian |
main |
9,034,837,939 |
ACL Anthology Reference Corpus (ARC) |
English |
open |
62,196,334 |
Afrikaans Wikipedia corpus 2018 (afwiki) |
Afrikaans |
trial |
14,466,792 |
American Spanish Web 2011 (esamTenTen11) |
Spanish |
trial |
7,475,579,365 |
Amharic Web 2013-17 (amWaC17) |
Amharic |
trial |
25,975,846 |
Arabic Learner Corpus (ALC) |
Arabic |
main |
362,712 |
Arabic Web 2009 |
Arabic |
main |
150,282,522 |
Arabic Web 2012 (arTenTen12, Stanford tagger) |
Arabic |
trial |
7,475,624,779 |
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) |
Arabic |
main |
115,315,274 |
Araneum Anglicum Africanum Maius [2015] |
English |
main |
854,484,093 |
Araneum Anglicum Asiaticum Maius [2015] |
English |
main |
867,259,037 |
Araneum Anglicum Maius [2015] |
English |
trial |
888,466,066 |
Araneum Finnicum Maius [2014] |
Finnish |
main |
817,453,523 |
Araneum Francogallicum Maius [2015] |
French |
main |
933,688,995 |
Araneum Germanicum Maius [2013] |
German |
main |
875,465,845 |
Araneum Hispanicum Maius [2013] |
Spanish |
main |
892,299,770 |
Araneum Hungaricum Maius [2014] |
Hungarian |
trial |
792,549,686 |
Araneum Italicum Maius (Italian, 14.12) 1,20 G |
Italian |
main |
890,568,533 |
Araneum Nederlandicum Maius [2013] |
Dutch |
main |
713,417,518 |
Araneum Polonicum Maius [2013] |
Polish |
main |
595,768,667 |
Araneum Portugallicum Maius [2015] |
Portuguese |
main |
862,134,902 |
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G |
Russian |
trial |
859,319,823 |
Araneum Slovacum Maius [2013] |
Slovak |
trial |
816,125,010 |
Basque Web (BasqueWaC v2) |
Basque |
trial |
99,719,584 |
Belarusian Web 2016 (beTenTen16) |
Belarusian |
trial |
63,327,264 |
Bengali Web (bnWaC) |
Bengali |
trial |
11,519,730 |
BIBLE Polish-Swahili |
Polish |
main |
138,216 |
BIBLE Swahili-Polish |
Swahili |
main |
139,160 |
Boot Camp English |
English |
trial |
85,683,246 |
Bosnian Web (bsWaC 1.2) |
Bosnian |
trial |
248,478,730 |
Brazilian Portuguese corpus (Corpus Brasileiro) |
Portuguese |
main |
871,117,178 |
Brexit corpus (English) |
English |
trial |
108,452,923 |
Brexit corpus without retweets (English) |
English |
trial |
4,789,571 |
British Academic Spoken English Corpus (BASE) |
English |
open |
1,477,281 |
British Academic Written English Corpus (BAWE) |
English |
open |
6,968,089 |
British Law Report Corpus |
English |
main |
8,515,749 |
British National Corpus (BNC) |
English |
trial |
96,134,547 |
British National Corpus (BNC) 2014 Spoken |
English |
trial |
10,495,185 |
British National Corpus (BNC), tagged by CLAWS |
English |
trial |
96,052,598 |
British Web 2007 (ukWaC) |
English |
main |
1,313,058,436 |
Brown |
English |
open |
1,007,299 |
Brown Family |
English |
main |
6,963,778 |
Brown Family, CLAWS + TreeTagger tags |
English |
main |
6,975,474 |
Bulgarian National Corpus (BulgarianNC) |
Bulgarian |
main |
20,975,703 |
Bulgarian National Corpus nonweb genres |
Bulgarian |
main |
22,398,507 |
Bulgarian National Corpus with web |
Bulgarian |
main |
419,512,059 |
Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) |
Bulgarian |
trial |
705,156,683 |
Cambridge Academic English |
English |
main |
3,163,648 |
Cantonese Web (CantoneseWaC) |
Cantonese |
trial |
30,898,663 |
Catalan Web 2014 (caTenTen14 v2) |
Catalan |
trial |
182,691,653 |
Cebuano Web 2018 (cebTenTen18) |
Cebuano |
trial |
4,552,105 |
CHILDES Afrikaans Corpus |
Afrikaans |
main |
26,020 |
CHILDES Catalan Corpus |
Catalan |
main |
209,525 |
CHILDES Croatian Corpus |
Croatian |
main |
300,832 |
CHILDES Danish Corpus |
Danish |
main |
285,231 |
CHILDES English Corpus |
English |
main |
22,693,506 |
CHILDES Estonian Corpus |
Estonian |
main |
313,457 |
CHILDES Farsi Corpus |
Persian |
main |
120,527 |
CHILDES French Corpus |
French |
main |
2,583,460 |
CHILDES Gaelic Corpus |
Irish |
main |
16,848 |
CHILDES German Corpus |
German |
main |
5,941,266 |
CHILDES Hebrew Corpus |
Hebrew |
main |
807,657 |
CHILDES Hungarian Corpus |
Hungarian |
main |
247,881 |
CHILDES Italian Corpus |
Italian |
main |
459,881 |
CHILDES Japanese Corpus |
Japanese |
main |
1,578,068 |
CHILDES Korean Corpus |
Korean |
main |
36,056 |
CHILDES Norwegian Corpus |
Norwegian (Mixed) |
main |
56,827 |
CHILDES Polish Corpus |
Polish |
main |
1,041,300 |
CHILDES Portuguese Corpus |
Portuguese |
main |
216,407 |
CHILDES Russian Corpus |
Russian |
main |
48,791 |
CHILDES Spanish Corpus |
Spanish |
main |
802,743 |
CHILDES Swedish Corpus |
Swedish |
main |
520,478 |
CHILDES Tamil Corpus |
Tamil |
main |
15,490 |
CHILDES Thai Corpus |
Thai |
main |
243,939 |
CHILDES Turkish Corpus |
Turkish |
main |
178,100 |
Chinese GigaWord 2 Corpus: Mainland, simplified |
Chinese Simplified |
main |
205,031,379 |
Chinese GigaWord 2 Corpus: Taiwan, traditional |
Chinese Traditional |
main |
382,600,557 |
Chinese Simplified Web 2017 sample |
Chinese Simplified |
trial |
250,361,047 |
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) |
Chinese Traditional |
main |
259,156,002 |
Chinese Traditional Web 2011 (TaiwanWaC) |
Chinese Traditional |
main |
259,156,002 |
Chinese Traditional Web 2017 (zhTenTen17) sample |
Chinese Traditional |
trial |
239,882,651 |
Chinese Web 2005 (Internet-ZH, NEUCSP tagger) |
Chinese Simplified |
main |
198,205,344 |
Chinese Web 2011 (zhTenTen11, sample 10M) |
Chinese Simplified |
main |
9,012,125 |
Chinese Web 2011 (zhTenTen11, Stanford tagger) |
Chinese Simplified |
trial |
1,729,867,455 |
Chinese Web 2017 (zhTenTen17) Simplified |
Chinese Simplified |
trial |
13,531,331,169 |
Chinese Web 2017 (zhTenTen17) Traditional |
Chinese Traditional |
trial |
2,400,405,372 |
COMPAS 2015 |
English |
access on demand |
114,967,191 |
COMPAS 2016 |
English |
access on demand |
260,896,404 |
CoPEP - The Corpus of Portuguese from Academic Journals (v. 1.4) |
Portuguese |
main |
40,423,011 |
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ |
N'Ko |
open |
4,102,593 |
Corpus of Academic Journal Articles (CAJA) |
English |
access on demand |
79,107,410 |
Corpus of English Dialogues 1560–1760 |
English |
access on demand |
1,151,171 |
Corpus of Estonian Web sentences 2020 |
Estonian |
main |
280,961,465 |
Covid-19 |
English |
open |
224,061,570 |
Croatian Web (hrWaC 2.2, ReLDI) |
Croatian |
trial |
1,210,021,198 |
Croatian Web (hrWaC 2.2, RFTagger) |
Croatian |
trial |
1,211,328,660 |
csSkELL v1 (whole documents) |
Czech |
main |
1,717,516,129 |
csSkELL v2.2 (sentences with GDEX scores) |
Czech |
main |
1,443,410,941 |
Cundeelee Wangka Stories (Cundeelee Wangka) |
Cundeelee Wangka |
access on demand |
1,965 |
Cundeelee Wangka Stories (English) |
English |
access on demand |
4,423 |
Czech news and web 1995–2002 (czes2.2) |
Czech |
main |
366,796,757 |
Czech Web 2017 (csTenTen17) |
Czech |
trial |
10,502,222,474 |
Czech Web 2017 sample |
Czech |
trial |
249,877,322 |
CzechParl 2012 (v2 with lempos) |
Czech |
main |
37,184,025 |
Danish Web 2010 (DanishWaC) |
Danish |
main |
288,272,967 |
Danish Web 2014 (daTenTen14) |
Danish |
main |
2,040,976,501 |
Danish Web 2017 (daTenTen17) |
Danish |
trial |
2,170,690,492 |
Danish Web 2017 sample |
Danish |
trial |
214,447,970 |
DGT, Bulgarian |
Bulgarian |
main |
25,912,721 |
DGT, Croatian |
Croatian |
main |
3,968,608 |
DGT, Czech |
Czech |
main |
43,621,933 |
DGT, Danish |
Danish |
main |
44,962,280 |
DGT, Dutch |
Dutch |
main |
50,523,892 |
DGT, English |
English |
main |
59,106,576 |
DGT, Estonian |
Estonian |
main |
34,155,488 |
DGT, Finnish |
Finnish |
main |
35,129,923 |
DGT, French |
French |
main |
58,224,781 |
DGT, German |
German |
main |
45,380,666 |
DGT, Greek |
Greek |
main |
51,865,988 |
DGT, Hungarian |
Hungarian |
main |
2,306,272 |
DGT, Irish |
Irish |
main |
1,065,421 |
DGT, Italian |
Italian |
main |
53,260,912 |
DGT, Latvian |
Latvian |
main |
38,898,134 |
DGT, Lithuanian |
Lithuanian |
main |
38,675,242 |
DGT, Maltese |
Maltese |
main |
22,388,562 |
DGT, Polish |
Polish |
main |
44,149,107 |
DGT, Portuguese |
Portuguese |
main |
53,950,705 |
DGT, Romanian |
Romanian |
main |
26,644,734 |
DGT, Slovak |
Slovak |
main |
43,276,048 |
DGT, Slovenian |
Slovenian |
main |
42,897,385 |
DGT, Spanish |
Spanish |
main |
57,311,149 |
DGT, Swedish |
Swedish |
main |
44,378,725 |
Dutch Web 2014 (nlTenTen14) |
Dutch |
trial |
2,253,777,579 |
Dutch Web 2014 sample |
Dutch |
trial |
250,219,005 |
e-flux (International art English) |
English |
main |
5,036,119 |
EcoLexicon English (Environment) |
English |
open |
23,169,446 |
English Broadsheet Newspapers 1993–2013 (SiBol with trends) |
English |
main |
654,435,535 |
English Corpus for SkELL 3.10 |
English |
main |
1,038,200,313 |
English Corpus for SkELL 3.8 |
English |
main |
1,041,772,774 |
English Corpus for SkELL 3.9 |
English |
main |
1,041,138,575 |
English Historical Book Collection (EEBO, ECCO, Evans) |
English |
main |
826,296,048 |
English Preposition Corpus |
English |
trial |
2,136,325 |
English Web 2008 (enTenTen08) |
English |
main |
2,759,340,513 |
English Web 2012 (enTenTen12) |
English |
main |
11,191,860,036 |
English Web 2013 (enTenTen13) |
English |
trial |
19,685,733,337 |
English Web 2013 sample |
English |
trial |
204,976,089 |
English Web 2015 (enTenTen15) |
English |
trial |
13,190,556,334 |
English Wikipedia |
English |
main |
1,356,523,079 |
English Wikipedia sample with Error annotations |
English |
trial |
951,824 |
Estonian Corpus for Learners 2020 (etSkELL) |
Estonian |
main |
280,572,215 |
Estonian National Corpus 2019 (Estonian NC 2019) |
Estonian |
trial |
1,500,284,681 |
Estonian Reference corpus 1990-2008 (EstonianRC) |
Estonian |
main |
203,267,951 |
Estonian Web 2013 (etTenTen13) |
Estonian |
trial |
260,559,829 |
EUR-Lex Bulgarian 2/2016 |
Bulgarian |
trial |
329,071,554 |
EUR-Lex Croatian 2/2016 |
Croatian |
trial |
109,138,184 |
EUR-Lex Czech 2/2016 |
Czech |
trial |
350,230,088 |
EUR-Lex Danish 2/2016 |
Danish |
trial |
519,765,085 |
EUR-Lex Dutch 2/2016 |
Dutch |
trial |
583,263,688 |
EUR-Lex English 2/2016 |
English |
trial |
629,722,593 |
EUR-Lex Estonian 2/2016 |
Estonian |
trial |
291,077,511 |
EUR-Lex Finnish 2/2016 |
Finnish |
trial |
384,119,975 |
EUR-Lex French 2/2016 |
French |
trial |
677,063,993 |
EUR-Lex German 2/2016 |
German |
trial |
528,617,843 |
EUR-Lex Greek 2/2016 |
Greek |
trial |
579,344,223 |
EUR-Lex Hungarian 2/2016 |
Hungarian |
trial |
340,618,970 |
EUR-Lex Irish 2/2016 |
Irish |
trial |
31,439,542 |
EUR-Lex Italian 2/2016 |
Italian |
trial |
606,070,097 |
EUR-Lex judgments Bulgarian 12/2016 |
Bulgarian |
trial |
17,071,495 |
EUR-Lex judgments Croatian 12/2016 |
Croatian |
trial |
5,613,468 |
EUR-Lex judgments Czech 12/2016 |
Czech |
trial |
18,226,505 |
EUR-Lex judgments Danish 12/2016 |
Danish |
trial |
34,934,021 |
EUR-Lex judgments Dutch 12/2016 |
Dutch |
trial |
40,534,071 |
EUR-Lex judgments English 12/2016 |
English |
trial |
42,339,337 |
EUR-Lex judgments Estonian 12/2016 |
Estonian |
trial |
15,029,608 |
EUR-Lex judgments Finnish 12/2016 |
Finnish |
trial |
23,601,422 |
EUR-Lex judgments French 12/2016 |
French |
trial |
48,023,524 |
EUR-Lex judgments German 12/2016 |
German |
trial |
35,297,517 |
EUR-Lex judgments Greek 12/2016 |
Greek |
trial |
35,815,108 |
EUR-Lex judgments Hungarian 12/2016 |
Hungarian |
trial |
17,940,879 |
EUR-Lex judgments Italian 12/2016 |
Italian |
trial |
42,053,315 |
EUR-Lex judgments Latvian 12/2016 |
Latvian |
trial |
16,908,831 |
EUR-Lex judgments Lithuanian 12/2016 |
Lithuanian |
trial |
16,252,111 |
EUR-Lex judgments Maltese 12/2016 |
Maltese |
trial |
19,146,797 |
EUR-Lex judgments Polish 12/2016 |
Polish |
trial |
18,799,551 |
EUR-Lex judgments Portuguese 12/2016 |
Portuguese |
trial |
35,412,936 |
EUR-Lex judgments Romanian 12/2016 |
Romanian |
trial |
17,592,388 |
EUR-Lex judgments Slovak 12/2016 |
Slovak |
trial |
18,265,664 |
EUR-Lex judgments Slovenian 12/2016 |
Slovenian |
trial |
18,439,766 |
EUR-Lex judgments Spanish 12/2016 |
Spanish |
trial |
39,431,836 |
EUR-Lex judgments Swedish 12/2016 |
Swedish |
trial |
30,666,764 |
EUR-Lex Latvian 2/2016 |
Latvian |
trial |
324,734,544 |
EUR-Lex Lithuanian 2/2016 |
Lithuanian |
trial |
323,151,426 |
EUR-Lex Maltese 2/2016 |
Maltese |
trial |
314,396,006 |
EUR-Lex Polish 2/2016 |
Polish |
trial |
360,862,149 |
EUR-Lex Portuguese 2/2016 |
Portuguese |
trial |
595,066,701 |
EUR-Lex Romanian 2/2016 |
Romanian |
trial |
336,928,068 |
EUR-Lex Slovak 2/2016 |
Slovak |
trial |
255,531,673 |
EUR-Lex Slovenian 2/2016 |
Slovenian |
trial |
351,899,258 |
EUR-Lex Spanish 2/2016 |
Spanish |
trial |
635,187,126 |
EUR-Lex Swedish 2/2016 |
Swedish |
trial |
478,485,126 |
EUROPARL7, Bulgarian |
Bulgarian |
trial |
9,215,233 |
EUROPARL7, Czech |
Czech |
trial |
13,013,774 |
EUROPARL7, Danish |
Danish |
trial |
48,343,860 |
EUROPARL7, Dutch |
Dutch |
trial |
54,007,722 |
EUROPARL7, English |
English |
trial |
53,837,625 |
EUROPARL7, Estonian |
Estonian |
trial |
11,171,727 |
EUROPARL7, Finnish |
Finnish |
trial |
34,182,031 |
EUROPARL7, French |
French |
trial |
59,145,988 |
EUROPARL7, German |
German |
trial |
47,805,055 |
EUROPARL7, Greek |
Greek |
trial |
38,868,863 |
EUROPARL7, Hungarian |
Hungarian |
trial |
12,421,715 |
EUROPARL7, Italian |
Italian |
trial |
52,871,060 |
EUROPARL7, Latvian |
Latvian |
trial |
11,920,085 |
EUROPARL7, Lithuanian |
Lithuanian |
trial |
11,424,032 |
EUROPARL7, Polish |
Polish |
trial |
13,034,164 |
EUROPARL7, Portuguese |
Portuguese |
trial |
53,778,766 |
EUROPARL7, Romanian |
Romanian |
trial |
9,554,864 |
EUROPARL7, Slovak |
Slovak |
trial |
12,942,651 |
EUROPARL7, Slovenian |
Slovenian |
trial |
12,496,942 |
EUROPARL7, Spanish |
Spanish |
trial |
54,302,284 |
EUROPARL7, Swedish |
Swedish |
trial |
46,303,799 |
European Spanish Web 2011 (eseuTenTen11) |
Spanish |
trial |
2,021,633,644 |
Finnish Web 2014 (fiTenTen14) |
Finnish |
trial |
1,404,083,812 |
Finnish Web 2014 (fiTenTen14, TreeTagger v2) |
Finnish |
main |
1,404,100,049 |
Finnish Web 2014 sample (fiTenTen14, TreeTagger v2) |
Finnish |
trial |
40,756,118 |
Frantext (French literature of the 18th-20th century) |
French |
main |
15,573,070 |
Frantext (French literature of the 18th-20th century), without trends |
French |
main |
15,573,070 |
French corpus of 88,000 SMS (88milSMS) |
French |
trial |
1,206,663 |
French Web 2008 (v2 with lempos) |
French |
main |
104,705,211 |
French Web 2010 (frWaC) |
French |
main |
1,330,564,200 |
French Web 2012 (frTenTen12) |
French |
trial |
9,889,689,889 |
French Web 2012 sample |
French |
trial |
205,185,797 |
French Web 2017 (frTenTen17) |
French |
trial |
5,752,261,039 |
French Web 2017 sample |
French |
trial |
404,555,405 |
Georgian Web 2013 (kaWaC) |
Georgian |
trial |
50,713,604 |
German Corpus for SkELL 1.0 |
German |
main |
769,810,745 |
German Political Speeches Corpus |
German |
trial |
11,144,258 |
German Web 2010 |
German |
main |
2,338,036,362 |
German Web 2010 (deWaC) |
German |
main |
1,348,188,416 |
German Web 2013 (deTenTen13) |
German |
trial |
16,526,335,416 |
German Web 2013 sample |
German |
trial |
193,838,751 |
GerManC (German Newspapers 1650-1800) |
German |
main |
667,310 |
Gigafida v2.0 (referenčni) |
Slovenian |
main |
1,109,441,592 |
Greek Web (GkWaC with lempos) |
Greek |
main |
124,285,612 |
Greek Web 2014 (elTenTen14) |
Greek |
trial |
1,671,692,845 |
Guangwai - Lancaster Chinese Learner Corpus |
Chinese Simplified |
open |
1,289,060 |
Gujarati Web (guWaC) |
Gujarati |
trial |
17,960,095 |
Hausa Web 2015 (hausaWaC15) |
Hausa (Boko) |
trial |
5,304,300 |
Hebrew General Corpus (web crawled, mostly newspapers) |
Hebrew |
main |
157,947,728 |
Hebrew Web (HebWaC) |
Hebrew |
main |
47,832,254 |
Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) |
Hebrew |
access on demand |
895,876,116 |
Hebrew Web 2014 (heTenTen14, no POS tagging) |
Hebrew |
trial |
890,282,843 |
Hindi Web 2012 (HindiWaC v. 4) |
Hindi |
trial |
107,960,109 |
Hindi Web 2013 (hiTenTen13) |
Hindi |
main |
351,289,441 |
Hungarian Web 2012 (huTenTen12) |
Hungarian |
trial |
2,572,620,694 |
Icelandic texts [sample] |
Icelandic |
trial |
5,436,035 |
Igbo Web 2015 (IgboWaC15) |
Igbo |
trial |
331,042 |
Indonesian Web (IndonesianWaC) |
Indonesian |
trial |
90,120,046 |
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) |
Irish |
open |
314,807 |
Italian Corpus for SkELL 1.0 |
Italian |
main |
328,270,600 |
Italian Web 2006 (itWaC) |
Italian |
main |
1,597,295,469 |
Italian Web 2010 (itTenTen) |
Italian |
main |
2,588,873,046 |
Italian Web 2016 (itTenTen16) |
Italian |
trial |
4,989,729,171 |
Italian Web 2016 sample |
Italian |
trial |
201,204,942 |
itWAC (reduced) |
Italian |
main |
751,542,948 |
Japanese Web 2006 (jpWaC) |
Japanese |
main |
336,867,039 |
Japanese Web 2011 (jaTenTen11) |
Japanese |
trial |
8,432,256,578 |
Japanese Web 2011 (jaTenTen11, sample) |
Japanese |
main |
301,407,652 |
Japanese Web 2011 sample (jaTenTen11, LUW) |
Japanese |
trial |
163,837,671 |
Kannada Web 2012 (knWaC12) |
Kannada |
trial |
11,056,526 |
KAS-Dipl (diplome) |
Slovenian |
main |
568,188,810 |
KAS-Dr (doktorati) |
Slovenian |
main |
30,244,519 |
KAS-Mag (magisteriji) |
Slovenian |
main |
157,168,378 |
Khmer Web 2018 (kmTenTen18) |
Khmer |
trial |
16,500,379 |
Korean 2018 term reference corpus (koTenTen18_term_ref) |
Korean |
trial |
83,749,660 |
Korean Web 2012 (koTenTen12) |
Korean |
main |
461,196,240 |
Korean Web 2018 (koTenTen18) |
Korean |
trial |
1,668,851,720 |
KSUCCA (Classical Arabic) |
Arabic |
trial |
46,705,577 |
Lao Web 2018 (loTenTen18) |
Lao |
trial |
15,862,991 |
Lao Web 2019 (loTenTen19) |
Lao |
trial |
105,018,584 |
LatinISE corpus |
Latin |
trial |
11,202,216 |
Latvian Web (LatvianWaC) |
Latvian |
main |
57,666,024 |
Latvian Web 2014 (lvTenTen14) |
Latvian |
trial |
530,367,474 |
Lektor (Learner corpus of proofread and translations) |
Slovenian |
main |
953,038 |
LEXMCI |
English |
main |
1,448,180,339 |
Lithuanian Web (LithuanianWaC v2) |
Lithuanian |
main |
48,650,918 |
Lithuanian Web 2014 (ltTenTen14) |
Lithuanian |
trial |
778,151,979 |
MagyarOK teaching materials for Hungarian, levels A1 to B2 |
Hungarian |
open |
144,832 |
Malayalam Web (malayalamWaC) |
Malayalam |
trial |
15,950,663 |
Malaysian Web (MalaysianWaC) |
Malay |
trial |
182,578,743 |
Maldivian Wikipedia corpus 2019 (dvwiki) |
Maldivian |
trial |
548,211 |
Maltese MLRS Corpus |
Maltese |
trial |
110,714,844 |
Maori Web 2013 and 2020 (miTenTen20) |
Maori |
trial |
11,814,825 |
Medical Web Corpus |
English |
main |
33,961,786 |
Mongolian Web Texts 2016 (mnWaC16) |
Mongolian |
trial |
6,104,565 |
Multicultural London English Corpus |
English |
main |
2,391,040 |
Nepali National Corpus |
Nepali |
trial |
13,440,835 |
Nepali Web (NepaliWaC) |
Nepali |
main |
1,290,388 |
New corpus for English (NCI English) |
English |
main |
217,548,758 |
New Model Corpus |
English |
main |
95,276,958 |
Newspapers in Portuguese (CetemPúblico, CetenFolha) |
Portuguese |
main |
56,768,822 |
Norwegian dictionary corpus (Nynorskkorpuset) |
Norwegian (Mixed) |
main |
74,496,664 |
Norwegian Web 2012 |
Norwegian (Mixed) |
main |
669,511,569 |
Norwegian Web 2017 (noTenTen17, Bokmål) |
Norwegian Bokmål |
trial |
2,472,483,911 |
Norwegian Web 2017 (noTenTen17, Nynorsk) |
Norwegian Nynorsk |
trial |
174,830,652 |
Norwegian Web 2017 sample (Bokmål) |
Norwegian Bokmål |
trial |
58,955,519 |
Norwegian Web 2017 sample (Nynorsk) |
Norwegian Nynorsk |
trial |
58,743,828 |
OEC |
English |
access on demand |
2,073,319,589 |
OEC v2 |
English |
access on demand |
2,073,563,928 |
Open Access Journals (DOAJ - English) |
English |
trial |
2,662,763,697 |
Open American National Corpus (spoken) |
English |
main |
3,202,026 |
Open American National Corpus (written) |
English |
main |
11,048,137 |
Open Cambridge Learner Corpus (Uncoded) |
English |
access on demand |
2,975,701 |
Opus MontenegrinSubs: English |
English |
trial |
468,337 |
Opus MontenegrinSubs: Montenegrin |
Montenegrin |
trial |
365,698 |
OPUS2 Afrikaans |
Afrikaans |
main |
586,334 |
OPUS2 Albanian |
Albanian |
trial |
46,304,346 |
OPUS2 Arabic |
Arabic |
main |
300,000,057 |
OPUS2 Bosnian |
Bosnian |
main |
43,582,516 |
OPUS2 Brazilian Portuguese |
Portuguese |
main |
272,300,927 |
OPUS2 Bulgarian |
Bulgarian |
main |
183,115,244 |
OPUS2 Chinese Simplified |
Chinese Simplified |
main |
243,427,123 |
OPUS2 Chinese Traditional |
Chinese Traditional |
main |
380,245 |
OPUS2 Croatian |
Croatian |
main |
121,369,625 |
OPUS2 Czech |
Czech |
main |
203,845,619 |
OPUS2 Danish |
Danish |
main |
120,107,271 |
OPUS2 Dutch |
Dutch |
main |
356,363,571 |
OPUS2 English |
English |
main |
1,139,515,048 |
OPUS2 Estonian |
Estonian |
main |
64,879,741 |
OPUS2 Finnish |
Finnish |
main |
131,985,872 |
OPUS2 French |
French |
main |
766,833,908 |
OPUS2 German |
German |
main |
125,229,773 |
OPUS2 Greek |
Greek |
main |
239,360,926 |
OPUS2 Hebrew |
Hebrew |
main |
130,972,343 |
OPUS2 Hindi |
Hindi |
main |
854,741 |
OPUS2 Hungarian |
Hungarian |
main |
157,495,018 |
OPUS2 Italian |
Italian |
main |
180,532,849 |
OPUS2 Japanese |
Japanese |
main |
5,455,106 |
OPUS2 Korean |
Korean |
main |
374,850 |
OPUS2 Latvian |
Latvian |
main |
24,499,516 |
OPUS2 Lithuanian |
Lithuanian |
main |
29,621,940 |
OPUS2 Macedonian |
Macedonian |
trial |
40,348,792 |
OPUS2 Norwegian |
Norwegian (Mixed) |
main |
20,237,510 |
OPUS2 Persian |
Persian |
trial |
4,425,133 |
OPUS2 Polish |
Polish |
main |
208,008,636 |
OPUS2 Portuguese |
Portuguese |
main |
297,700,205 |
OPUS2 Romanian |
Romanian |
main |
282,408,295 |
OPUS2 Russian |
Russian |
main |
307,709,872 |
OPUS2 Serbian |
Serbian |
main |
153,237,786 |
OPUS2 Slovak |
Slovak |
main |
62,451,407 |
OPUS2 Slovenian |
Slovenian |
main |
121,228,966 |
OPUS2 Spanish |
Spanish |
main |
111,497 |
OPUS2 Swedish |
Swedish |
main |
102,298,686 |
OPUS2 Turkish |
Turkish |
main |
151,342,424 |
OPUS2 Ukrainian |
Ukrainian |
main |
2,578,289 |
Oromo Web 2016 (orWaC16) |
Oromo |
trial |
4,249,953 |
Oxford Children's Corpus 2015 (PTag) |
English |
access on demand |
210,322,185 |
Oxford Children's Corpus 2015 -- Education (PTag) |
English |
access on demand |
1,323,174 |
Oxford Children's Corpus 2015 -- Reading (PTag) |
English |
access on demand |
34,284,687 |
Oxford Children's Corpus 2015 -- Writing (PTag) |
English |
access on demand |
174,714,324 |
Oxford Children's Corpus 2016 (PTag) |
English |
access on demand |
284,360,063 |
Oxford Children's Corpus 2016 -- Reading (PTag) |
English |
access on demand |
53,858,955 |
Oxford Children's Corpus 2016 -- Writing (PTag) |
English |
access on demand |
229,177,934 |
Oxford Corpus of Academic English (April 2012) |
English |
access on demand |
71,372,972 |
Paisa |
Italian |
main |
221,989,288 |
Parsed German Web (sDeWaC) |
German |
main |
755,165,551 |
Penn Corpora of Historical English |
English |
access on demand |
3,800,639 |
PICAE 2010 |
English |
access on demand |
31,025,920 |
Polish Web (PolishWac, Morfeusz and TaKIPI tagger) |
Polish |
main |
103,028,410 |
Polish Web 2012 (plTenTen12, RFTagger) |
Polish |
trial |
7,715,835,214 |
Polish Web 2012 sample |
Polish |
trial |
191,648,244 |
Portuguese Web 2011 (ptTenTen11) |
Portuguese |
trial |
3,896,392,719 |
Portuguese Web 2011 (ptTenTen11, Palavras parsed) |
Portuguese |
main |
2,757,635,105 |
Portuguese Web 2011 sample |
Portuguese |
trial |
202,548,549 |
Project Gutenberg English |
English |
main |
443,471,071 |
pukWaC (ukWaC parsed with MaltParser) |
English |
main |
39,502,648 |
Quran annotated corpus [unvowelled Arabic] |
Arabic |
main |
128,243 |
Quran annotated corpus [unvowelled Latin] |
Arabic |
main |
99,268 |
Quran annotated corpus [vowelled Arabic] |
Arabic |
main |
128,241 |
Quran annotated corpus [vowelled Latin] |
Arabic |
main |
97,970 |
RapCor1288 - Francophone rap songs |
French |
trial |
709,057 |
Riznica v0.1 |
Croatian |
main |
85,273,724 |
Romanian Web 2016 (roTenTen16) |
Romanian |
trial |
2,640,496,763 |
ruSkELL 1.6 |
Russian |
main |
975,584,449 |
Russian Web 2006 (v2 with lempos) |
Russian |
main |
147,930,261 |
Russian Web 2011 (ruTenTen11) |
Russian |
trial |
14,553,856,113 |
Russian Web 2011 sample (ruTenTen11) |
Russian |
trial |
998,099,963 |
Samoan Web (SamoanWac1) |
Samoan |
trial |
3,115,385 |
ScienceBlogs |
English |
main |
103,175,233 |
Scottish Gaelic Wiki 2015 (gdWiki) |
Scottish Gaelic |
trial |
980,026 |
Semcor v3.0 (sense-tagged corpus) |
English |
main |
664,038 |
Serbian Web (srWaC 1.2 processed by Hunpos) |
Serbian |
trial |
477,724,164 |
Serbian Web (srWaC 1.2 processed by RFTagger v1) |
Serbian (Latin) |
trial |
441,888,202 |
Serbian Web (srWaC 1.2) |
Serbian (Latin) |
trial |
476,888,297 |
Setswana/Tswana Web (SetswanaWaC v2) |
Setswana |
trial |
11,496,687 |
Slovak Web 2011 (skTenTen11) |
Slovak |
trial |
540,112,634 |
Slovak Web 2011 (skTenTen11, ambiguity tag, lempos) |
Slovak |
main |
715,707,053 |
Slovak Web 2011 sample |
Slovak |
trial |
189,609,195 |
Slovenian reference corpus (FidaPLUS v2) |
Slovenian |
trial |
600,309,670 |
Slovenian Web (slWaC 2.1 processed with TreeTagger v2) |
Slovenian |
trial |
755,255,547 |
Slovenian Web (slWaC 2.1) |
Slovenian |
trial |
754,255,589 |
Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
Slovenian |
trial |
829,544,337 |
Slovenian Web 2015 sample |
Slovenian |
trial |
195,792,821 |
Somali Web 2016 (soWaC16) |
Somali |
trial |
71,871,585 |
SoNaR |
Dutch |
access on demand |
425,978,755 |
Spanish Web 2005 (SpanishWaC) |
Spanish |
main |
97,773,185 |
Spanish Web 2011 (esTenTen11, Eu + Am) |
Spanish |
trial |
9,497,213,009 |
Spanish Web 2011 sample |
Spanish |
trial |
212,142,794 |
Spanish Web 2018 (esTenTen18) |
Spanish |
trial |
17,553,075,259 |
Spanish Web 2018 sample |
Spanish |
trial |
177,257,648 |
Susanne |
English |
trial |
128,998 |
Swahili Web 2014 (SwahiliWaC) |
Swahili |
trial |
17,882,483 |
Swedish Web 2014 (svTenTen14) |
Swedish |
trial |
3,401,035,817 |
Swedish Web 2014 sample |
Swedish |
trial |
45,477,881 |
SwedishParole |
Swedish |
main |
21,735,113 |
Tagalog (Filipino) Web 2019 (tlTenTen19) |
Tagalog |
trial |
197,908,842 |
Tajik Web (TajikWaC) |
Tajik |
trial |
93,151,897 |
TalkBank Persian (blog posts) |
Persian |
main |
474,773,547 |
Tamil Web 2015 (TamilWaC) |
Tamil |
trial |
26,750,515 |
Tatar Mixed Corpus |
Tatar |
trial |
102,779,803 |
Tatar News (2000-2014), version with lempos |
Tatar |
main |
24,927,439 |
Tatar Web 2015 sample |
Tatar |
trial |
195,901 |
Ted Talks transcripts |
English |
main |
2,882,085 |
Telugu Web 2017 (teTenTen) |
Telugu |
trial |
126,807,158 |
Thai Web (ThaiWaC) |
Thai |
trial |
82,787,119 |
Thai Web 2018 (thTenTen18) |
Thai |
trial |
640,530,227 |
The Annotated Corpus of Classical Tibetan (ACTib 2.0) |
Tibetan |
trial |
170,202,078 |
The New Corpus for Ireland |
Irish |
main |
29,886,201 |
Tigrinya Web 2016 (tiWaC16) |
Tigrinya |
trial |
2,087,613 |
Timestamped JSI web corpus 2014-2016 Arabic |
Arabic |
trial |
976,573,611 |
Timestamped JSI web corpus 2014-2016 Catalan |
Catalan |
trial |
99,395,494 |
Timestamped JSI web corpus 2014-2016 Czech |
Czech |
trial |
289,488,005 |
Timestamped JSI web corpus 2014-2016 Dutch |
Dutch |
trial |
401,347,934 |
Timestamped JSI web corpus 2014-2016 English |
English |
trial |
18,315,071,361 |
Timestamped JSI web corpus 2014-2016 Finnish |
Finnish |
trial |
119,109,490 |
Timestamped JSI web corpus 2014-2016 French |
French |
trial |
1,870,341,756 |
Timestamped JSI web corpus 2014-2016 German |
German |
trial |
1,987,759,563 |
Timestamped JSI web corpus 2014-2016 Hebrew |
Hebrew |
trial |
111,339,363 |
Timestamped JSI web corpus 2014-2016 Hungarian |
Hungarian |
trial |
180,843,359 |
Timestamped JSI web corpus 2014-2016 Italian |
Italian |
trial |
1,375,907,374 |
Timestamped JSI web corpus 2014-2016 Korean |
Korean |
trial |
438,816,127 |
Timestamped JSI web corpus 2014-2016 Polish |
Polish |
trial |
157,930,228 |
Timestamped JSI web corpus 2014-2016 Portuguese |
Portuguese |
trial |
1,109,771,393 |
Timestamped JSI web corpus 2014-2016 Russian |
Russian |
trial |
1,120,731,416 |
Timestamped JSI web corpus 2014-2016 Serbian |
Serbian |
trial |
86,380,673 |
Timestamped JSI web corpus 2014-2016 Spanish |
Spanish |
trial |
4,055,944,612 |
Timestamped JSI web corpus 2014-2016 Swedish |
Swedish |
trial |
335,782,681 |
Timestamped JSI web corpus 2014-2020 Arabic |
Arabic |
main |
4,121,147,715 |
Timestamped JSI web corpus 2014-2020 Catalan |
Catalan |
main |
373,235,642 |
Timestamped JSI web corpus 2014-2020 Czech |
Czech |
main |
901,794,639 |
Timestamped JSI web corpus 2014-2020 Dutch |
Dutch |
main |
1,181,836,141 |
Timestamped JSI web corpus 2014-2020 English |
English |
main |
53,106,755,084 |
Timestamped JSI web corpus 2014-2020 Finnish |
Finnish |
main |
369,454,982 |
Timestamped JSI web corpus 2014-2020 French |
French |
main |
5,982,741,890 |
Timestamped JSI web corpus 2014-2020 German |
German |
main |
6,194,176,109 |
Timestamped JSI web corpus 2014-2020 Hebrew |
Hebrew |
main |
406,351,360 |
Timestamped JSI web corpus 2014-2020 Hungarian |
Hungarian |
main |
714,951,341 |
Timestamped JSI web corpus 2014-2020 Italian |
Italian |
main |
6,509,458,717 |
Timestamped JSI web corpus 2014-2020 Korean |
Korean |
main |
1,438,494,218 |
Timestamped JSI web corpus 2014-2020 Polish |
Polish |
main |
729,292,544 |
Timestamped JSI web corpus 2014-2020 Portuguese |
Portuguese |
main |
3,957,241,843 |
Timestamped JSI web corpus 2014-2020 Russian |
Russian |
main |
4,791,961,483 |
Timestamped JSI web corpus 2014-2020 Serbian |
Serbian |
main |
466,051,344 |
Timestamped JSI web corpus 2014-2020 Spanish |
Spanish |
main |
13,834,261,153 |
Timestamped JSI web corpus 2014-2020 Swedish |
Swedish |
main |
1,007,079,426 |
Timestamped JSI web corpus 2020-09 Arabic |
Arabic |
main |
93,839,059 |
Timestamped JSI web corpus 2020-09 Catalan |
Catalan |
main |
9,114,479 |
Timestamped JSI web corpus 2020-09 Czech |
Czech |
main |
16,500,590 |
Timestamped JSI web corpus 2020-09 Dutch |
Dutch |
main |
27,350,237 |
Timestamped JSI web corpus 2020-09 English |
English |
main |
944,265,733 |
Timestamped JSI web corpus 2020-09 Finnish |
Finnish |
main |
7,165,935 |
Timestamped JSI web corpus 2020-09 French |
French |
main |
133,128,037 |
Timestamped JSI web corpus 2020-09 German |
German |
main |
119,113,152 |
Timestamped JSI web corpus 2020-09 Hebrew |
Hebrew |
main |
7,962,757 |
Timestamped JSI web corpus 2020-09 Hungarian |
Hungarian |
main |
21,325,758 |
Timestamped JSI web corpus 2020-09 Italian |
Italian |
main |
251,646,734 |
Timestamped JSI web corpus 2020-09 Korean |
Korean |
main |
19,413,863 |
Timestamped JSI web corpus 2020-09 Polish |
Polish |
main |
29,946,442 |
Timestamped JSI web corpus 2020-09 Portuguese |
Portuguese |
main |
96,906,119 |
Timestamped JSI web corpus 2020-09 Russian |
Russian |
main |
133,493,258 |
Timestamped JSI web corpus 2020-09 Serbian |
Serbian |
main |
12,175,985 |
Timestamped JSI web corpus 2020-09 Spanish |
Spanish |
main |
325,029,575 |
Timestamped JSI web corpus 2020-09 Swedish |
Swedish |
main |
20,294,470 |
Timestamped JSI web corpus 2020-10 Arabic |
Arabic |
main |
96,538,837 |
Timestamped JSI web corpus 2020-10 Catalan |
Catalan |
main |
9,685,481 |
Timestamped JSI web corpus 2020-10 Czech |
Czech |
main |
17,378,113 |
Timestamped JSI web corpus 2020-10 Dutch |
Dutch |
main |
30,202,034 |
Timestamped JSI web corpus 2020-10 English |
English |
main |
986,590,708 |
Timestamped JSI web corpus 2020-10 Finnish |
Finnish |
main |
7,660,361 |
Timestamped JSI web corpus 2020-10 French |
French |
main |
138,015,892 |
Timestamped JSI web corpus 2020-10 German |
German |
main |
127,987,516 |
Timestamped JSI web corpus 2020-10 Hebrew |
Hebrew |
main |
8,401,215 |
Timestamped JSI web corpus 2020-10 Hungarian |
Hungarian |
main |
22,408,596 |
Timestamped JSI web corpus 2020-10 Italian |
Italian |
main |
259,816,566 |
Timestamped JSI web corpus 2020-10 Korean |
Korean |
main |
19,346,769 |
Timestamped JSI web corpus 2020-10 Polish |
Polish |
main |
32,034,885 |
Timestamped JSI web corpus 2020-10 Portuguese |
Portuguese |
main |
101,374,205 |
Timestamped JSI web corpus 2020-10 Russian |
Russian |
main |
138,972,026 |
Timestamped JSI web corpus 2020-10 Serbian |
Serbian |
main |
13,713,045 |
Timestamped JSI web corpus 2020-10 Spanish |
Spanish |
main |
340,052,637 |
Timestamped JSI web corpus 2020-10 Swedish |
Swedish |
main |
21,327,238 |
Turkic web – Azerbaijani |
Azerbaijani |
trial |
94,267,206 |
Turkic web – Kazakh |
Kazakh |
trial |
139,417,763 |
Turkic web – Kyrgyz |
Kyrgyz |
trial |
19,369,507 |
Turkic web – Turkmen |
Turkmen |
trial |
2,105,359 |
Turkic web – Uzbek |
Uzbek |
trial |
18,720,334 |
Turkish Web (trWaC) |
Turkish |
main |
32,791,491 |
Turkish Web 2012 (trTenTen12) |
Turkish |
trial |
3,388,418,900 |
Ukrainian Web 2014 (ukTenTen14) |
Ukrainian |
trial |
2,194,447,594 |
UKWaC super sensed |
English |
main |
315,402,632 |
Urdu Web (UrduWaC) |
Urdu |
trial |
53,269,273 |
Urdu Web 2018 (urTenTen18) |
Urdu |
trial |
245,656,128 |
Vietnamese Web (VietnameseWaC) |
Vietnamese |
trial |
106,464,835 |
Welsh Web 2013 (WelshWaC) |
Welsh |
trial |
12,458,397 |
Welsh web corpus |
Welsh |
main |
50,392,441 |
Western Frisian Web 2013 (FrisianWaC) |
Frisian |
trial |
3,116,119 |
Yiddish Wikipedia corpus 2018 (yiwiki) |
Yiddish |
trial |
2,106,912 |
Yoruba Web 2015 (YorubaWaC15) |
Yoruba |
trial |
2,816,965 |