CHILDES Afrikaans Corpus |
Afrikaans |
main |
33,134 |
OPUS2 Afrikaans |
Afrikaans |
trial |
743,954 |
OPUS2 Albanian |
Albanian |
trial |
55,099,328 |
Amharic WaC [2013 + 2015 + 2016] |
Amharic |
trial |
20,287,250 |
Arabic Web |
Arabic |
main |
174,239,600 |
KSUCCA (Classical Arabic) |
Arabic |
main |
59,693,146 |
Arabic Learner Corpus (ALC) |
Arabic |
main |
386,583 |
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) |
Arabic |
main |
131,159,731 |
OPUS2 Arabic |
Arabic |
main |
406,527,277 |
Quran annotated corpus [unvowelled Arabic] |
Arabic |
main |
128,243 |
Quran annotated corpus [unvowelled Latin] |
Arabic |
main |
128,243 |
Quran annotated corpus [vowelled Arabic] |
Arabic |
main |
128,243 |
Quran annotated corpus [vowelled Latin] |
Arabic |
main |
128,243 |
Timestamped JSI web corpus 2014-2016 Arabic |
Arabic |
trial |
1,084,155,423 |
Arabic Web 2012 (arTenTen12, Stanford tagger) |
Arabic |
trial |
8,322,097,229 |
Turkic web – Azerbaijani |
Azerbaijani |
trial |
115,280,755 |
Basque Web (BasqueWaC) |
Basque |
trial |
123,856,183 |
Bengali Web (BengaliWaC) |
Bengali |
trial |
13,752,575 |
OPUS2 Bosnian |
Bosnian |
main |
55,224,138 |
Bosnian Web 2014 (BosnianWaC14) |
Bosnian |
trial |
290,176,507 |
Bulgarian National Corpus (BulgarianNC) |
Bulgarian |
main |
26,518,884 |
Bulgarian National Corpus nonweb genres |
Bulgarian |
main |
27,721,533 |
Bulgarian National Corpus with web |
Bulgarian |
main |
545,637,740 |
DGT, Bulgarian |
Bulgarian |
main |
32,778,982 |
EUR-Lex judgments Bulgarian 12/2016 |
Bulgarian |
main |
21,537,635 |
OPUS2 Bulgarian |
Bulgarian |
main |
238,945,836 |
Bulgarian Web 2012 (bgTenTen12) |
Bulgarian |
trial |
843,328,184 |
EUR-Lex Bulgarian 2/2016 |
Bulgarian |
trial |
457,463,831 |
EUROPARL7, Bulgarian |
Bulgarian |
trial |
10,602,635 |
CHILDES Catalan Corpus |
Catalan |
main |
277,816 |
Catalan Web 2014 (caTenTen14) |
Catalan |
trial |
4,777,786,899 |
Timestamped JSI web corpus 2014-2016 Catalan |
Catalan |
trial |
114,317,450 |
Chinese GigaWord 2 Corpus: Mainland, simplified |
Chinese Simplified |
main |
250,124,230 |
Chinese Web (Internet-ZH) |
Chinese Simplified |
main |
277,931,664 |
OPUS2 Chinese Simplified |
Chinese Simplified |
main |
299,338,099 |
Chinese Web 2011 (zhTenTen11, sample 10M) |
Chinese Simplified |
main |
11,028,308 |
Guangwai - Lancaster Chinese Learner Corpus |
Chinese Simplified |
open |
1,664,237 |
Chinese Web 2011 (zhTenTen11) |
Chinese Simplified |
trial |
2,106,661,021 |
Chinese GigaWord 2 Corpus: Taiwan, traditional |
Chinese Traditional |
main |
455,526,209 |
Chinese Traditional Web (TaiwanWaC) |
Chinese Traditional |
main |
349,198,060 |
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) |
Chinese Traditional |
main |
349,198,060 |
OPUS2 Chinese Traditional |
Chinese Traditional |
main |
622,382 |
CHILDES Croatian Corpus |
Croatian |
main |
389,674 |
DGT, Croatian |
Croatian |
main |
5,123,494 |
EUR-Lex judgments Croatian 12/2016 |
Croatian |
main |
7,416,811 |
OPUS2 Croatian |
Croatian |
main |
156,942,211 |
Croatian Web 2011 & 2013 (hrWaC 2.2) |
Croatian |
trial |
1,397,757,548 |
EUR-Lex Croatian 2/2016 |
Croatian |
trial |
156,309,317 |
csSkELL v1 (whole documents) |
Czech |
main |
2,072,446,673 |
csSkELL v2.1 (only sentences with GDEX scores) |
Czech |
main |
1,863,757,837 |
csSkELL v2.2 (sentences with GDEX scores) |
Czech |
main |
1,726,564,383 |
csSkELL v2 (only sentences with GDEX scores) |
Czech |
main |
1,946,264,694 |
CzechParl 2012 |
Czech |
main |
51,366,108 |
Czech news and web 1995–2002 (czes2) |
Czech |
main |
458,225,771 |
Czech Web 2012 (czTenTen12 v8, sample) |
Czech |
main |
64,607,138 |
DGT, Czech |
Czech |
main |
57,094,285 |
EUR-Lex judgments Czech 12/2016 |
Czech |
main |
23,906,139 |
OPUS2 Czech |
Czech |
main |
275,519,334 |
Timestamped JSI web corpus 2014-2016 Czech |
Czech |
trial |
344,176,348 |
Czech Web 2012 (czTenTen12 v9) |
Czech |
trial |
5,069,447,935 |
EUR-Lex Czech 2/2016 |
Czech |
trial |
501,361,784 |
EUROPARL7, Czech |
Czech |
trial |
15,290,586 |
CHILDES Danish Corpus |
Danish |
main |
372,811 |
Danish Web (DanishWaC) |
Danish |
main |
353,703,002 |
DGT, Danish |
Danish |
main |
58,810,703 |
EUR-Lex judgments Danish 12/2016 |
Danish |
main |
45,307,188 |
OPUS2 Danish |
Danish |
main |
153,261,335 |
Danish Web 2014 (daTenTen14) |
Danish |
trial |
2,395,139,491 |
EUR-Lex Danish 2/2016 |
Danish |
trial |
731,423,452 |
EUROPARL7, Danish |
Danish |
trial |
55,794,038 |
CHILDES Dutch Corpus |
Dutch |
main |
7,592,039 |
DGT, Dutch |
Dutch |
main |
62,654,517 |
EUR-Lex judgments Dutch 12/2016 |
Dutch |
main |
49,746,950 |
Araneum Nederlandicum Maius [2013] |
Dutch |
main |
1,200,000,837 |
OPUS2 Dutch |
Dutch |
main |
446,240,037 |
EUR-Lex Dutch 2/2016 |
Dutch |
trial |
783,154,917 |
EUROPARL7, Dutch |
Dutch |
trial |
59,756,704 |
Timestamped JSI web corpus 2014-2016 Dutch |
Dutch |
trial |
463,471,686 |
Dutch Web 2014 (nlTenTen14) |
Dutch |
trial |
3,013,056,738 |
British Law Report Corpus |
English |
main |
10,036,051 |
Brown Family, CLAWS + TreeTagger tags |
English |
main |
8,073,482 |
Brown Family |
English |
main |
8,099,732 |
CHILDES English Corpus |
English |
main |
29,480,736 |
Cambridge Academic English |
English |
main |
3,738,308 |
DGT, English |
English |
main |
74,365,007 |
English Historical Book Collection (EEBO, ECCO, Evans) |
English |
main |
987,242,247 |
e-flux (International art English) |
English |
main |
6,238,592 |
e-flux (International art English) |
English |
main |
6,238,592 |
Araneum Anglicum Africanum Maius [2015] |
English |
main |
1,200,000,194 |
Araneum Anglicum Asiaticum Maius [2015] |
English |
main |
1,200,000,489 |
English Preposition Corpus |
English |
main |
2,430,218 |
English Web 2012 (enTenTen12, sample 40M) |
English |
main |
40,920,950 |
English Web 2008 (enTenTen08) |
English |
main |
3,268,798,627 |
English Wikipedia |
English |
main |
1,632,582,504 |
Project Gutenberg English |
English |
main |
529,531,582 |
EUR-Lex judgments English 12/2016 |
English |
main |
51,499,120 |
London English Corpus |
English |
main |
2,959,320 |
LEXMCI |
English |
main |
1,720,056,987 |
New Model Corpus |
English |
main |
114,627,650 |
New corpus for English (NCI English) |
English |
main |
257,900,777 |
Open American National Corpus (spoken) |
English |
main |
3,369,613 |
Open American National Corpus (written) |
English |
main |
13,572,382 |
OPUS2 English |
English |
main |
1,441,844,046 |
pukWaC (ukWaC parsed with MaltParser) |
English |
main |
46,256,586 |
ScienceBlogs |
English |
main |
122,942,494 |
SiBol/Port (English broadsheet newspapers) |
English |
main |
387,585,716 |
English Corpus for SkELL 3.6 |
English |
main |
1,237,286,904 |
TED_en (transcripts of TED talks) |
English |
main |
3,421,262 |
ukWaC (British Web corpus) |
English |
main |
1,559,716,979 |
UKWaC super sensed |
English |
main |
370,023,634 |
ACL Anthology Reference Corpus (ARC) |
English |
open |
49,348,397 |
British Academic Spoken English Corpus (BASE) |
English |
open |
1,252,256 |
British Academic Written English Corpus (BAWE) |
English |
open |
8,336,262 |
Brown |
English |
open |
1,175,675 |
EcoLexicon English corpus |
English |
open |
28,616,037 |
British National Corpus (BNC), tagged by CLAWS |
English |
trial |
112,181,015 |
British National Corpus (BNC) |
English |
trial |
112,345,722 |
Directory of Open Access Journals (English) |
English |
trial |
3,349,931,737 |
Araneum Anglicum Maius [2015] |
English |
trial |
1,200,023,361 |
Timestamped JSI web corpus 2014-2016 English |
English |
trial |
21,336,894,049 |
Timestamped web corpus combined 2005-2015 (Newsfeed+Feed) |
English |
trial |
9,928,357,596 |
English Web 2013 (enTenTen13) |
English |
trial |
22,728,686,012 |
EUR-Lex English 2/2016 |
English |
trial |
845,040,420 |
EUROPARL7, English |
English |
trial |
60,741,877 |
Timestamped web corpus 2005-2014 (Feed) |
English |
trial |
640,820,898 |
Susanne |
English |
trial |
150,426 |
CHILDES Estonian Corpus |
Estonian |
main |
399,547 |
DGT, Estonian |
Estonian |
main |
46,445,829 |
Estonian Reference corpus with Web (EstonianNC) |
Estonian |
main |
563,220,548 |
Estonian Reference corpus (EstonianRC) |
Estonian |
main |
249,923,332 |
Estonian Web 2013 (etTenTen13) [New Word Sketches] |
Estonian |
main |
330,045,196 |
EUR-Lex judgments Estonian 12/2016 |
Estonian |
main |
20,279,247 |
OPUS2 Estonian |
Estonian |
main |
88,432,596 |
Estonian Web 2013 (etTenTen13) |
Estonian |
trial |
330,045,196 |
EUR-Lex Estonian 2/2016 |
Estonian |
trial |
437,435,453 |
EUROPARL7, Estonian |
Estonian |
trial |
13,162,640 |
Filipino Web (FilipinoWaC) |
Filipino |
trial |
31,845,404 |
Philippine Web (philippineWaC16) |
Filipino |
trial |
40,302,836 |
DGT, Finnish |
Finnish |
main |
47,397,459 |
Araneum Finnicum Maius [2014] |
Finnish |
main |
1,200,000,486 |
EUR-Lex judgments Finnish 12/2016 |
Finnish |
main |
30,993,755 |
OPUS2 Finnish |
Finnish |
main |
180,134,681 |
EUR-Lex Finnish 2/2016 |
Finnish |
trial |
558,884,960 |
EUROPARL7, Finnish |
Finnish |
trial |
40,979,520 |
Timestamped JSI web corpus 2014-2016 Finnish |
Finnish |
trial |
143,709,979 |
Finnish Web 2014 (fiTenTen14, TreeTagger v2) |
Finnish |
trial |
1,703,429,270 |
CHILDES French Corpus |
French |
main |
3,287,017 |
DGT, French |
French |
main |
70,602,745 |
Frantext (French literature of the 18th-20th century) |
French |
main |
26,265,698 |
Araneum Francogallicum Maius [2015] |
French |
main |
1,200,004,721 |
French Web 2012 sample (frTenTen12) |
French |
main |
39,472,639 |
EUR-Lex judgments French 12/2016 |
French |
main |
58,993,172 |
OPUS2 French |
French |
main |
956,614,852 |
French web corpus |
French |
main |
126,850,281 |
EUR-Lex French 2/2016 |
French |
trial |
920,640,086 |
EUROPARL7, French |
French |
trial |
66,661,141 |
Timestamped JSI web corpus 2014-2016 French |
French |
trial |
2,188,593,260 |
French Web 2012 (frTenTen12) |
French |
trial |
11,444,973,582 |
Western Frisian Web 2013 (FrisianWaC) |
Frisian |
trial |
3,738,968 |
Georgian Web (georgianWaC) |
Georgian |
trial |
63,632,861 |
Araneum Germanicum Maius [2013] |
German |
main |
1,200,000,146 |
German Web 2013 sample (deTenTen13) |
German |
main |
65,804,983 |
German Web (deWaC) |
German |
main |
1,627,169,557 |
DGT, German |
German |
main |
58,319,542 |
GerManC (German Newspapers 1650-1800) |
German |
main |
800,783 |
EUR-Lex judgments German 12/2016 |
German |
main |
44,891,478 |
OPUS2 German |
German |
main |
157,849,124 |
Parsed German Web (sDeWaC) |
German |
main |
886,661,231 |
German Web 2013 (deTenTen13) |
German |
trial |
19,808,173,163 |
Timestamped JSI web corpus 2014-2016 German |
German |
trial |
2,378,228,966 |
EUR-Lex German 2/2016 |
German |
trial |
718,370,201 |
EUROPARL7, German |
German |
trial |
55,251,638 |
DGT, Greek |
Greek |
main |
64,538,668 |
Greek Web (GkWaC) |
Greek |
main |
149,067,023 |
EUR-Lex judgments Greek 12/2016 |
Greek |
main |
44,825,698 |
OPUS2 Greek |
Greek |
main |
305,404,357 |
Greek Web 2014 (elTenTen14) |
Greek |
trial |
1,958,348,129 |
EUR-Lex Greek 2/2016 |
Greek |
trial |
775,079,501 |
EUROPARL7, Greek |
Greek |
trial |
44,097,921 |
Gujarati Web (GujarathiWaC) |
Gujarati |
trial |
22,201,247 |
CHILDES Hebrew Corpus |
Hebrew |
main |
1,034,238 |
Hebrew General Corpus (web crawled, mostly newspapers) |
Hebrew |
main |
192,119,449 |
Hebrew Web (HebWaC) |
Hebrew |
main |
60,351,738 |
OPUS2 Hebrew |
Hebrew |
main |
252,278,074 |
Timestamped JSI web corpus 2014-2016 Hebrew |
Hebrew |
trial |
134,830,039 |
Hebrew Web 2014 (heTenTen14) |
Hebrew |
trial |
1,061,788,271 |
Hindi Web (HindiWaC v. 3) |
Hindi |
main |
65,772,188 |
Hindi Web (HindiWaC v. 4) |
Hindi |
main |
120,600,574 |
OPUS2 Hindi |
Hindi |
main |
1,642,973 |
Hindi Web 2013 (hiTenTen13) |
Hindi |
trial |
405,366,140 |
CHILDES Hungarian Corpus |
Hungarian |
main |
311,543 |
DGT, Hungarian |
Hungarian |
main |
55,276,730 |
Hungarian Web 2012 (huTenTen12) |
Hungarian |
main |
3,184,161,466 |
EUR-Lex judgments Hungarian 12/2016 |
Hungarian |
main |
24,542,189 |
OPUS2 Hungarian |
Hungarian |
main |
218,409,426 |
EUR-Lex Hungarian 2/2016 |
Hungarian |
trial |
499,799,589 |
EUROPARL7, Hungarian |
Hungarian |
trial |
14,655,015 |
Araneum Hungaricum Maius [2014] |
Hungarian |
trial |
1,200,001,609 |
Timestamped JSI web corpus 2014-2016 Hungarian |
Hungarian |
trial |
218,405,214 |
Icelandic texts [sample] |
Icelandic |
trial |
9,968,822 |
Igbo Web 2015 (IgboWaC15) |
Igbo |
trial |
396,276 |
Indonesian Web (IndonesianWaC) |
Indonesian |
trial |
109,281,359 |
CHILDES Gaelic Corpus |
Irish |
main |
20,823 |
DGT, Irish |
Irish |
main |
1,251,732 |
EUR-Lex Irish 2/2016 |
Irish |
trial |
37,467,080 |
New Corpus for Ireland (NCI Irish) |
Irish |
trial |
34,358,267 |
CHILDES Italian Corpus |
Italian |
main |
572,217 |
DGT, Italian |
Italian |
main |
65,936,285 |
Araneum Italicum Maius (Italian, 14.12) 1,20 G |
Italian |
main |
1,200,000,174 |
Italian Web 2010 sample (itTenTen) |
Italian |
main |
48,904,255 |
Italian web corpus (itWaC) |
Italian |
main |
1,909,535,703 |
EUR-Lex judgments Italian 12/2016 |
Italian |
main |
52,943,414 |
OPUS2 Italian |
Italian |
main |
231,143,960 |
EUR-Lex Italian 2/2016 |
Italian |
trial |
829,319,312 |
EUROPARL7, Italian |
Italian |
trial |
59,177,399 |
Timestamped JSI web corpus 2014-2016 Italian |
Italian |
trial |
1,573,777,557 |
Italian Web 2010 (itTenTen) |
Italian |
trial |
3,076,908,415 |
CHILDES Japanese Corpus |
Japanese |
main |
2,187,308 |
Japanese Web (JpWaC) |
Japanese |
main |
413,310,996 |
OPUS2 Japanese |
Japanese |
main |
6,596,733 |
Japanese Web 2011 (jpTenTen11) |
Japanese |
trial |
10,321,875,664 |
Japanese Web 2011 sample (jpTenTen11, LUW) |
Japanese |
trial |
203,674,569 |
Turkic web – Kazakh |
Kazakh |
trial |
175,445,327 |
CHILDES Korean Corpus |
Korean |
main |
53,339 |
OPUS2 Korean |
Korean |
main |
500,152 |
Timestamped JSI web corpus 2014-2016 Korean |
Korean |
trial |
547,918,466 |
Korean Web 2012 (koTenTen12) |
Korean |
trial |
560,945,022 |
Turkic web – Kyrgyz |
Kyrgyz |
trial |
24,084,100 |
LatinISE historical corpus v2 |
Latin |
trial |
12,995,824 |
DGT, Latvian |
Latvian |
main |
54,287,472 |
EUR-Lex judgments Latvian 12/2016 |
Latvian |
main |
21,977,367 |
Latvian Web (LatvianWaC) |
Latvian |
main |
74,447,302 |
OPUS2 Latvian |
Latvian |
main |
34,012,690 |
EUR-Lex Latvian 2/2016 |
Latvian |
trial |
491,388,506 |
EUROPARL7, Latvian |
Latvian |
trial |
14,253,247 |
Latvian Web 2014 (lvTenTen14) |
Latvian |
trial |
657,522,048 |
DGT, Lithuanian |
Lithuanian |
main |
52,155,372 |
EUR-Lex judgments Lithuanian 12/2016 |
Lithuanian |
main |
21,558,688 |
Lithuanian Web (LithuanianWaC v2) |
Lithuanian |
main |
63,645,700 |
OPUS2 Lithuanian |
Lithuanian |
main |
40,933,573 |
EUR-Lex Lithuanian 2/2016 |
Lithuanian |
trial |
476,891,405 |
EUROPARL7, Lithuanian |
Lithuanian |
trial |
13,733,247 |
Lithuanian Web 2014 (ltTenTen14) |
Lithuanian |
trial |
981,517,649 |
OPUS2 Macedonian |
Macedonian |
trial |
49,066,513 |
Malayalam Web (malayalamWaC) |
Malayalam |
trial |
21,193,984 |
Malaysian Web (MalaysianWaC) |
Malay |
trial |
230,509,568 |
DGT, Maltese |
Maltese |
main |
30,172,433 |
EUR-Lex judgments Maltese 12/2016 |
Maltese |
main |
26,865,968 |
EUR-Lex Maltese 2/2016 |
Maltese |
trial |
466,854,303 |
Maltese MLRS Corpus |
Maltese |
trial |
125,267,653 |
Maori Web (MaoriWaC) |
Maori |
trial |
8,351,983 |
Mongolian Web Texts 2016 (mnWaC16) |
Mongolian |
trial |
7,540,919 |
Nepali Web (NepaliWaC) |
Nepali |
main |
1,464,492 |
Nepali National Corpus |
Nepali |
trial |
15,137,459 |
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ |
N'Ko |
open |
4,636,227 |
CHILDES Norwegian Corpus |
Norwegian |
main |
61,075 |
Norwegian dictionary corpus (Nynorskkorpuset) |
Norwegian |
main |
87,228,361 |
OPUS2 Norwegian |
Norwegian |
main |
26,467,755 |
Norwegian Web 2015 (noTenTen15; Bokmål and Nynorsk) |
Norwegian |
trial |
1,953,892,201 |
Oromo WaC [2016] |
Oromo |
trial |
5,091,696 |
Kannada Web (KannadaWaC) |
-- other (UTF-8) -- |
main |
16,031,481 |
CHILDES Farsi Corpus |
Persian |
main |
150,505 |
TalkBank Persian (blog posts) |
Persian |
main |
549,165,952 |
OPUS2 Persian |
Persian |
trial |
5,367,401 |
BIBLE Polish-Swahili |
Polish |
main |
169,934 |
CHILDES Polish Corpus |
Polish |
main |
1,247,919 |
DGT, Polish |
Polish |
main |
58,520,395 |
EUR-Lex judgments Polish 12/2016 |
Polish |
main |
23,884,080 |
OPUS2 Polish |
Polish |
main |
285,188,755 |
Araneum Polonicum Maius [2013] |
Polish |
main |
1,110,120,694 |
Polish Web 2012 sample (plTenTen12) |
Polish |
main |
55,381,476 |
Polish Web (PolishWac) |
Polish |
main |
128,185,119 |
EUR-Lex Polish 2/2016 |
Polish |
trial |
510,957,144 |
EUROPARL7, Polish |
Polish |
trial |
15,171,493 |
Polish Web 2012 (plTenTen12) |
Polish |
trial |
9,387,142,186 |
Timestamped JSI web corpus 2014-2016 Polish |
Polish |
trial |
190,687,002 |
Brazilian Portuguese corpus (Corpus Brasileiro) |
Portuguese |
main |
1,133,416,757 |
CHILDES Portuguese Corpus |
Portuguese |
main |
245,805 |
DGT, Portuguese |
Portuguese |
main |
65,967,069 |
EUR-Lex judgments Portuguese 12/2016 |
Portuguese |
main |
44,247,824 |
OPUS2 Brazilian Portuguese |
Portuguese |
main |
355,049,778 |
OPUS2 Portuguese |
Portuguese |
main |
377,677,225 |
Newspapers in Portuguese (CetemPúblico, CetenFolha) |
Portuguese |
main |
66,319,147 |
Araneum Portugallicum Maius [2015] |
Portuguese |
main |
1,200,006,068 |
Portuguese Web 2011 sample (ptTenTen11, Freeling) |
Portuguese |
main |
44,446,042 |
Portuguese Web 2011 (ptTenTen11, Palavras parsed) |
Portuguese |
main |
3,245,834,337 |
EUR-Lex Portuguese 2/2016 |
Portuguese |
trial |
801,597,194 |
EUROPARL7, Portuguese |
Portuguese |
trial |
61,414,188 |
Timestamped JSI web corpus 2014-2016 Portuguese |
Portuguese |
trial |
1,312,377,855 |
Portuguese Web 2011 (ptTenTen11, Freeling v3) |
Portuguese |
trial |
4,626,584,246 |
DGT, Romanian |
Romanian |
main |
33,395,126 |
EUR-Lex judgments Romanian 12/2016 |
Romanian |
main |
22,055,262 |
OPUS2 Romanian |
Romanian |
main |
360,212,949 |
EUR-Lex Romanian 2/2016 |
Romanian |
trial |
461,819,855 |
EUROPARL7, Romanian |
Romanian |
trial |
10,795,858 |
Romanian Web (roWaC) |
Romanian |
trial |
53,457,522 |
Romanian Web 2016 (roTenTen16) |
Romanian |
trial |
3,142,636,172 |
CHILDES Russian Corpus |
Russian |
main |
59,759 |
OPUS2 Russian |
Russian |
main |
381,468,257 |
Araneum Russicum Maius (Russian, 15.02) 1,20 G |
Russian |
main |
1,200,001,911 |
Araneum Russicum Externum Maius (non-Russia Russian, 15.03) 1,20 G |
Russian |
main |
1,200,053,619 |
Araneum Russicum Maius [2013] |
Russian |
main |
1,216,800,424 |
ruSkELL 1.3 |
Russian |
main |
1,223,960,925 |
Russian web corpus |
Russian |
main |
187,965,822 |
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G |
Russian |
trial |
1,200,000,258 |
Timestamped JSI web corpus 2014-2016 Russian |
Russian |
trial |
1,402,853,056 |
Russian Web 2011 (ruTenTen11) |
Russian |
trial |
18,280,486,876 |
Samoan Web (SamoanWac1) |
Samoan |
trial |
3,583,362 |
Scottish Gaelic Wiki 2015 (gdWiki) |
Scottish Gaelic |
trial |
1,223,562 |
OPUS2 Serbian |
Serbian |
main |
198,141,613 |
Serbian Web 2014 (srWaC14) |
Serbian |
trial |
561,529,963 |
Timestamped JSI web corpus 2014-2016 Serbian |
Serbian |
trial |
100,816,398 |
Setswana/Tswana Web (SetswanaWaC v2) |
Setswana |
trial |
13,511,692 |
DGT, Slovak |
Slovak |
main |
56,095,893 |
EUR-Lex judgments Slovak 12/2016 |
Slovak |
main |
23,707,422 |
OPUS2 Slovak |
Slovak |
main |
82,952,296 |
Slovak Web 2011 (skTenTen11, ambiguity tag) |
Slovak |
main |
876,003,720 |
EUR-Lex Slovak 2/2016 |
Slovak |
trial |
366,709,333 |
EUROPARL7, Slovak |
Slovak |
trial |
15,042,066 |
Araneum Slovacum Maius [2013] |
Slovak |
trial |
1,200,005,746 |
Slovak Web 2011 (skTenTen11) |
Slovak |
trial |
656,067,998 |
DGT, Slovenian |
Slovenian |
main |
57,009,023 |
Lektor (Learner corpus of proofread and translations) |
Slovenian |
main |
1,244,028 |
EUR-Lex judgments Slovenian 12/2016 |
Slovenian |
main |
23,991,001 |
KAS-Dipl (diplome) |
Slovenian |
main |
713,212,210 |
KAS-Dr (doktorati) |
Slovenian |
main |
39,850,036 |
KAS-Mag (magisteriji) |
Slovenian |
main |
196,745,908 |
OPUS2 Slovenian |
Slovenian |
main |
163,160,520 |
EUR-Lex Slovenian 2/2016 |
Slovenian |
trial |
509,063,338 |
EUROPARL7, Slovenian |
Slovenian |
trial |
14,616,666 |
Slovenian reference corpus (FidaPLUS v2) |
Slovenian |
trial |
738,503,145 |
Slovenian Web 2015 (slTenTen15) |
Slovenian |
trial |
988,513,467 |
Somali WaC [2016] |
Somali |
trial |
79,741,231 |
CHILDES Spanish Corpus |
Spanish |
main |
1,358,475 |
DGT, Spanish |
Spanish |
main |
68,721,827 |
Araneum Hispanicum Maius [2013] |
Spanish |
main |
1,200,000,609 |
Spanish Web 2011 sample (esTenTen11, Eu + Am, Freeling v4) |
Spanish |
main |
73,597,801 |
EUR-Lex judgments Spanish 12/2016 |
Spanish |
main |
47,235,792 |
OPUS2 Spanish |
Spanish |
main |
870,615,999 |
Spanish Web corpus (SpanishWaC) |
Spanish |
main |
116,900,060 |
American Spanish Web 2011 (esamTenTen11) |
Spanish |
trial |
8,641,717,816 |
European Spanish Web 2011 (eseuTenTen11) |
Spanish |
trial |
2,343,829,757 |
Spanish Web 2011 (esTenTen11, Eu + Am) |
Spanish |
trial |
10,985,547,573 |
EUR-Lex Spanish 2/2016 |
Spanish |
trial |
811,673,158 |
EUROPARL7, Spanish |
Spanish |
trial |
60,862,330 |
Timestamped JSI web corpus 2014-2016 Spanish |
Spanish |
trial |
4,665,332,420 |
BIBLE Swahili-Polish |
Swahili |
main |
169,612 |
Swahili Web 2014 (SwahiliWaC) |
Swahili |
trial |
21,359,529 |
CHILDES Swedish Corpus |
Swedish |
main |
665,889 |
DGT, Swedish |
Swedish |
main |
55,407,291 |
EUR-Lex judgments Swedish 12/2016 |
Swedish |
main |
37,061,009 |
OPUS2 Swedish |
Swedish |
main |
128,245,911 |
SwedishParole |
Swedish |
main |
25,731,328 |
EUR-Lex Swedish 2/2016 |
Swedish |
trial |
640,815,888 |
EUROPARL7, Swedish |
Swedish |
trial |
51,759,122 |
Swedish Web 2014 (svTenTen14) |
Swedish |
trial |
3,900,846,988 |
Tajik Web (TajikWaC) |
Tajik |
trial |
109,805,133 |
CHILDES Tamil Corpus |
Tamil |
main |
21,865 |
Tamil Web 2015 (TamilWaC) |
Tamil |
trial |
32,861,569 |
Tatar Web 2015 sample |
Tatar |
trial |
290,351 |
Telugu Web (TeluguWaC) |
Telugu |
trial |
4,697,932 |
CHILDES Thai Corpus |
Thai |
main |
299,962 |
Thai Web (ThaiWaC) |
Thai |
trial |
108,013,897 |
Tibetan Corpus 2 |
Tibetan |
trial |
91,107,466 |
Tigrinya WaC [2016] |
Tigrinya |
trial |
2,531,443 |
CHILDES Turkish Corpus |
Turkish |
main |
233,097 |
OPUS2 Turkish |
Turkish |
main |
207,223,730 |
Turkish Web |
Turkish |
main |
40,539,507 |
Turkish Web 2012 (trTenTen12) |
Turkish |
trial |
4,124,558,200 |
Turkic web – Turkmen |
Turkmen |
trial |
2,536,935 |
OPUS2 Ukrainian |
Ukrainian |
main |
3,374,552 |
Ukrainian Web 2014 (uaTenTen14) |
Ukrainian |
trial |
2,734,851,744 |
Urdu Web (UrduWaC) |
Urdu |
trial |
60,808,847 |
Turkic web – Uzbek |
Uzbek |
trial |
24,570,516 |
Vietnamese Web (VietnameseWaC) |
Vietnamese |
trial |
129,781,089 |
Welsh web corpus |
Welsh |
main |
62,753,279 |
Welsh Web 2013 (WelshWaC) |
Welsh |
trial |
14,786,791 |
Yoruba Web 2015 (YorubaWaC15) |
Yoruba |
trial |
3,500,353 |