Overview of text corpora publicly available in Sketch Engine

A corpus is a public corpus if it is available to trial or paying subscribers or if it is open via the open access interface. In addition to these corpora, Sketch Engine holds other corpora with restricted access subject to copyright regulations or owned and controlled by third parties.

Category

main – corpora available only for regular (paying) users

trial – corpora available for both trial and regular users

open – corpora available without registration

Click a corpus name for full details.

Name Language Access policy Size in words
[DEV] Norwegian Web 2017 (noTenTen17, Bokmål, DEV SAMPLE) Norwegian Bokmål trial 58,955,519
[DEV] Norwegian Web 2017 (noTenTen17, Nynorsk, DEV SAMPLE) Norwegian Nynorsk trial 58,743,828
[DEV] Swedish Web 2014 (svTenTen14) -- sample Swedish trial 45,477,881
ACL Anthology Reference Corpus (ARC) English open 62,196,334
Afrikaans Wikipedia corpus 2018 (afwiki) Afrikaans trial 14,466,792
American Spanish Web 2011 (esamTenTen11) Spanish trial 7,475,579,365
Amharic Web 2013-17 (amWaC17) Amharic trial 25,975,846
Arabic Learner Corpus (ALC) Arabic main 362,712
Arabic Web Arabic main 150,282,522
Arabic Web 2012 (arTenTen12, Stanford tagger) Arabic trial 7,475,624,779
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) Arabic main 115,315,274
Araneum Anglicum Africanum Maius [2015] English main 854,484,093
Araneum Anglicum Asiaticum Maius [2015] English main 867,259,037
Araneum Anglicum Maius [2015] English trial 888,466,066
Araneum Finnicum Maius [2014] Finnish main 817,453,523
Araneum Francogallicum Maius [2015] French main 933,688,995
Araneum Germanicum Maius [2013] German main 875,465,845
Araneum Hispanicum Maius [2013] Spanish main 892,299,770
Araneum Hungaricum Maius [2014] Hungarian trial 792,549,686
Araneum Italicum Maius (Italian, 14.12) 1,20 G Italian main 890,568,533
Araneum Nederlandicum Maius [2013] Dutch main 713,417,518
Araneum Polonicum Maius [2013] Polish main 595,768,667
Araneum Portugallicum Maius [2015] Portuguese main 862,134,902
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G Russian trial 859,319,823
Araneum Slovacum Maius [2013] Slovak trial 816,125,010
Basque Web (BasqueWaC v2) Basque trial 99,719,584
Belarusian Web 2016 (beTenTen16) Belarusian trial 63,327,264
Bengali Web (bnWaC) Bengali trial 11,519,730
BIBLE Polish-Swahili Polish main 138,216
BIBLE Swahili-Polish Swahili main 139,160
Boot Camp English English trial 85,683,246
Bosnian Web (bsWaC 1.2) Bosnian trial 248,478,730
Brazilian Portuguese corpus (Corpus Brasileiro) Portuguese main 871,117,178
Brexit corpus (English) English trial 108,452,923
Brexit corpus without retweets (English) English trial 4,789,571
British Academic Spoken English Corpus (BASE) English open 1,186,290
British Academic Written English Corpus (BAWE) English open 6,968,089
British Law Report Corpus English main 8,515,749
British National Corpus (BNC) English trial 96,134,547
British National Corpus (BNC), tagged by CLAWS English trial 96,052,598
British Web 2007 (ukWaC) English main 1,313,058,436
Brown English open 1,007,299
Brown Family English main 6,963,778
Brown Family, CLAWS + TreeTagger tags English main 6,975,474
Bulgarian National Corpus (BulgarianNC) Bulgarian main 20,975,703
Bulgarian National Corpus nonweb genres Bulgarian main 22,398,507
Bulgarian National Corpus with web Bulgarian main 419,512,059
Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) Bulgarian trial 705,156,683
Cambridge Academic English English main 3,163,648
Cantonese Web (CantoneseWaC) Cantonese trial 30,898,663
Catalan Web 2014 (caTenTen14 v2) Catalan trial 182,691,653
Cebuano Web 2018 (cebTenTen18) Cebuano trial 4,552,105
CHILDES Afrikaans Corpus Afrikaans main 26,020
CHILDES Catalan Corpus Catalan main 209,525
CHILDES Croatian Corpus Croatian main 300,832
CHILDES Danish Corpus Danish main 285,231
CHILDES English Corpus English main 22,693,506
CHILDES Estonian Corpus Estonian main 313,457
CHILDES Farsi Corpus Persian main 120,527
CHILDES French Corpus French main 2,583,460
CHILDES Gaelic Corpus Irish main 16,848
CHILDES German Corpus German main 5,941,266
CHILDES Hebrew Corpus Hebrew main 807,657
CHILDES Hungarian Corpus Hungarian main 247,881
CHILDES Italian Corpus Italian main 459,881
CHILDES Japanese Corpus Japanese main 1,578,068
CHILDES Korean Corpus Korean main 36,056
CHILDES Norwegian Corpus Norwegian (Mixed) main 56,827
CHILDES Polish Corpus Polish main 1,041,300
CHILDES Portuguese Corpus Portuguese main 216,407
CHILDES Russian Corpus Russian main 48,791
CHILDES Spanish Corpus Spanish main 802,743
CHILDES Swedish Corpus Swedish main 520,478
CHILDES Tamil Corpus Tamil main 15,490
CHILDES Thai Corpus Thai main 243,939
CHILDES Turkish Corpus Turkish main 178,100
Chinese GigaWord 2 Corpus: Mainland, simplified Chinese Simplified main 205,031,379
Chinese GigaWord 2 Corpus: Taiwan, traditional Chinese Traditional main 382,600,557
Chinese Simplified Web 2017 sample Chinese Simplified trial 250,361,047
Chinese Traditional Web (TaiwanWaC) Chinese Traditional main 259,156,002
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) Chinese Traditional main 259,156,002
Chinese Traditional Web 2017 (zhTenTen17) sample Chinese Traditional trial 239,882,651
Chinese Web (Internet-ZH, NEUCSP tagger) Chinese Simplified main 198,205,344
Chinese Web 2011 (zhTenTen11, sample 10M) Chinese Simplified main 9,012,125
Chinese Web 2011 (zhTenTen11, Stanford tagger) Chinese Simplified trial 1,729,867,455
Chinese Web 2017 (zhTenTen17) Simplified Chinese Simplified trial 13,531,331,169
Chinese Web 2017 (zhTenTen17) Traditional Chinese Traditional trial 2,400,405,372
CoPEP - The Corpus of Portuguese from Academic Journals (v. 1.4) Portuguese main 40,423,011
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ N'Ko open 4,102,593
Corpus of Academic Journal Articles (CAJA) English access on demand 79,107,410
Corpus of English Dialogues 1560–1760 English access on demand 1,151,171
Croatian Web (hrWaC 2.2, ReLDI) Croatian trial 1,210,021,198
Croatian Web (hrWaC 2.2, RFTagger) Croatian trial 1,211,328,660
csSkELL v1 (whole documents) Czech main 1,717,516,129
csSkELL v2.2 (sentences with GDEX scores) Czech main 1,443,410,941
Cundeelee Wangka Stories (Cundeelee Wangka) Cundeelee Wangka access on demand 1,965
Cundeelee Wangka Stories (English) English access on demand 4,423
Czech news and web 1995–2002 (czes2.2) Czech main 366,796,757
Czech Web 2017 (csTenTen17) Czech trial 10,502,222,474
Czech Web 2017 sample Czech trial 249,877,322
CzechParl 2012 (v2 with lempos) Czech main 37,184,025
Danish Web (DanishWaC) Danish main 288,272,967
Danish Web 2014 (daTenTen14) Danish main 2,040,976,501
Danish Web 2017 (daTenTen17) Danish trial 2,170,690,492
Danish Web 2017 sample Danish trial 214,447,970
DGT, Bulgarian Bulgarian main 25,912,721
DGT, Croatian Croatian main 3,968,608
DGT, Czech Czech main 43,621,933
DGT, Danish Danish main 44,962,280
DGT, Dutch Dutch main 50,523,892
DGT, English English main 59,106,576
DGT, Estonian Estonian main 34,155,488
DGT, Finnish Finnish main 35,129,923
DGT, French French main 58,224,781
DGT, German German main 45,380,666
DGT, Greek Greek main 51,865,988
DGT, Hungarian Hungarian main 2,306,272
DGT, Irish Irish main 1,065,421
DGT, Italian Italian main 53,260,912
DGT, Latvian Latvian main 38,898,134
DGT, Lithuanian Lithuanian main 38,675,242
DGT, Maltese Maltese main 22,388,562
DGT, Polish Polish main 44,149,107
DGT, Portuguese Portuguese main 53,950,705
DGT, Romanian Romanian main 26,644,734
DGT, Slovak Slovak main 43,276,048
DGT, Slovenian Slovenian main 42,897,385
DGT, Spanish Spanish main 57,311,149
DGT, Swedish Swedish main 44,378,725
Dutch Web 2014 (nlTenTen14) Dutch trial 2,253,777,579
Dutch Web 2014 (nlTenTen14, TreeTagger v2) Dutch main 2,538,714,434
Dutch Web 2014 sample Dutch trial 250,219,005
e-flux (International art English) English main 5,036,119
EcoLexicon English (Environment) English open 23,169,446
English Broadsheet Newspapers 1993–2013 (SiBol with trends) English main 654,435,535
English Corpus for SkELL 3.10 English main 1,038,200,313
English Corpus for SkELL 3.8 English main 1,041,772,774
English Corpus for SkELL 3.9 English main 1,041,138,575
English Historical Book Collection (EEBO, ECCO, Evans) English main 826,296,048
English Preposition Corpus English trial 2,136,325
English Web 2008 (enTenTen08) English main 2,759,340,513
English Web 2012 (enTenTen12) English main 11,191,860,036
English Web 2013 (enTenTen13) English trial 19,685,733,337
English Web 2013 sample English trial 204,976,089
English Web 2015 (enTenTen15) English trial 15,703,895,409
English Wikipedia English main 1,356,523,079
English Wikipedia sample with Error annotations English trial 951,824
Estonian National Corpus 2013 (Estonian NC 2013) Estonian main 463,827,780
Estonian National Corpus 2017 (Estonian NC 2017) Estonian main 1,107,584,469
Estonian Reference corpus 1990-2008 (EstonianRC) Estonian main 203,267,951
Estonian Web 2013 (etTenTen13) Estonian trial 260,559,829
EUR-Lex Bulgarian 2/2016 Bulgarian trial 329,071,554
EUR-Lex Croatian 2/2016 Croatian trial 109,138,184
EUR-Lex Czech 2/2016 Czech trial 350,230,088
EUR-Lex Danish 2/2016 Danish trial 519,765,085
EUR-Lex Dutch 2/2016 Dutch trial 583,263,688
EUR-Lex English 2/2016 English trial 629,722,593
EUR-Lex Estonian 2/2016 Estonian trial 291,077,511
EUR-Lex Finnish 2/2016 Finnish trial 384,119,975
EUR-Lex French 2/2016 French trial 677,063,993
EUR-Lex German 2/2016 German trial 528,617,843
EUR-Lex Greek 2/2016 Greek trial 579,344,223
EUR-Lex Hungarian 2/2016 Hungarian trial 340,618,970
EUR-Lex Irish 2/2016 Irish trial 31,439,542
EUR-Lex Italian 2/2016 Italian trial 606,070,097
EUR-Lex judgments Bulgarian 12/2016 Bulgarian trial 17,071,495
EUR-Lex judgments Croatian 12/2016 Croatian trial 5,613,468
EUR-Lex judgments Czech 12/2016 Czech trial 18,226,505
EUR-Lex judgments Danish 12/2016 Danish trial 34,934,021
EUR-Lex judgments Dutch 12/2016 Dutch trial 40,534,071
EUR-Lex judgments English 12/2016 English trial 42,339,337
EUR-Lex judgments Estonian 12/2016 Estonian trial 15,029,608
EUR-Lex judgments Finnish 12/2016 Finnish trial 23,601,422
EUR-Lex judgments French 12/2016 French trial 48,023,524
EUR-Lex judgments German 12/2016 German trial 35,297,517
EUR-Lex judgments Greek 12/2016 Greek trial 35,815,108
EUR-Lex judgments Hungarian 12/2016 Hungarian trial 17,940,879
EUR-Lex judgments Italian 12/2016 Italian trial 42,053,315
EUR-Lex judgments Latvian 12/2016 Latvian trial 16,908,831
EUR-Lex judgments Lithuanian 12/2016 Lithuanian trial 16,252,111
EUR-Lex judgments Maltese 12/2016 Maltese trial 19,146,797
EUR-Lex judgments Polish 12/2016 Polish trial 18,799,551
EUR-Lex judgments Portuguese 12/2016 Portuguese trial 35,412,936
EUR-Lex judgments Romanian 12/2016 Romanian trial 17,592,388
EUR-Lex judgments Slovak 12/2016 Slovak trial 18,265,664
EUR-Lex judgments Slovenian 12/2016 Slovenian trial 18,439,766
EUR-Lex judgments Spanish 12/2016 Spanish trial 39,431,836
EUR-Lex judgments Swedish 12/2016 Swedish trial 30,666,764
EUR-Lex Latvian 2/2016 Latvian trial 324,734,544
EUR-Lex Lithuanian 2/2016 Lithuanian trial 323,151,426
EUR-Lex Maltese 2/2016 Maltese trial 314,396,006
EUR-Lex Polish 2/2016 Polish trial 360,862,149
EUR-Lex Portuguese 2/2016 Portuguese trial 595,066,701
EUR-Lex Romanian 2/2016 Romanian trial 336,928,068
EUR-Lex Slovak 2/2016 Slovak trial 255,531,673
EUR-Lex Slovenian 2/2016 Slovenian trial 351,899,258
EUR-Lex Spanish 2/2016 Spanish trial 635,187,126
EUR-Lex Swedish 2/2016 Swedish trial 478,485,126
EUROPARL7, Bulgarian Bulgarian trial 9,215,233
EUROPARL7, Czech Czech trial 13,013,774
EUROPARL7, Danish Danish trial 48,343,860
EUROPARL7, Dutch Dutch trial 54,007,722
EUROPARL7, English English trial 53,837,625
EUROPARL7, Estonian Estonian trial 11,171,727
EUROPARL7, Finnish Finnish trial 34,182,031
EUROPARL7, French French trial 59,145,988
EUROPARL7, German German trial 47,805,055
EUROPARL7, Greek Greek trial 38,868,863
EUROPARL7, Hungarian Hungarian trial 12,421,715
EUROPARL7, Italian Italian trial 52,871,060
EUROPARL7, Latvian Latvian trial 11,920,085
EUROPARL7, Lithuanian Lithuanian trial 11,424,032
EUROPARL7, Polish Polish trial 13,034,164
EUROPARL7, Portuguese Portuguese trial 53,778,766
EUROPARL7, Romanian Romanian trial 9,554,864
EUROPARL7, Slovak Slovak trial 12,942,651
EUROPARL7, Slovenian Slovenian trial 12,496,942
EUROPARL7, Spanish Spanish trial 54,302,284
EUROPARL7, Swedish Swedish trial 46,303,799
European Spanish Web 2011 (eseuTenTen11) Spanish trial 2,021,633,644
Finnish Web 2014 (fiTenTen14) Finnish trial 1,404,083,812
Finnish Web 2014 (fiTenTen14, TreeTagger v2) Finnish main 1,404,100,049
Finnish Web 2014 sample (fiTenTen14, TreeTagger v2) Finnish trial 40,756,118
Frantext (French literature of the 18th-20th century) French main 15,573,070
Frantext (French literature of the 18th-20th century), without trends French main 15,573,070
French Web 2012 (frTenTen12) French trial 9,889,689,889
French Web 2012 (frTenTen12) sample French trial 86,447,827
French Web 2012 sample French trial 205,185,797
French Web 2017 (frTenTen17) French trial 5,752,261,039
French Web 2017 sample French trial 404,555,405
French web corpus (frWaC) French main 1,330,564,200
French web corpus (v2 with lempos) French main 104,705,211
Georgian Web 2013 (kaWaC) Georgian trial 50,713,604
German Corpus for SkELL 1.0 German main 769,810,745
German Web (deWaC) German main 1,348,188,416
German Web 2010 German main 2,338,036,362
German Web 2013 (deTenTen13) German trial 16,526,335,416
German Web 2013 sample German trial 193,838,751
GerManC (German Newspapers 1650-1800) German main 667,310
Gigafida v2.0 (referenčni) Slovenian main 1,109,441,592
Greek Web (GkWaC with lempos) Greek main 124,285,612
Greek Web 2014 (elTenTen14) Greek trial 1,671,692,845
Guangwai - Lancaster Chinese Learner Corpus Chinese Simplified open 1,289,060
Gujarati Web (guWaC) Gujarati trial 17,960,095
Hausa Web 2015 (hausaWaC15) Hausa (Boko) trial 5,304,300
Hebrew General Corpus (web crawled, mostly newspapers) Hebrew main 157,947,728
Hebrew Web (HebWaC) Hebrew main 47,832,254
Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) Hebrew access on demand 895,876,116
Hebrew Web 2014 (heTenTen14, no POS tagging) Hebrew trial 890,282,843
Hindi Web (HindiWaC v. 4) Hindi trial 107,960,109
Hindi Web 2013 (hiTenTen13) Hindi main 351,289,441
Hungarian Web 2012 (huTenTen12) Hungarian trial 2,572,620,694
Icelandic texts [sample] Icelandic trial 5,436,035
Igbo Web 2015 (IgboWaC15) Igbo trial 331,042
Indonesian Web (IndonesianWaC) Indonesian trial 90,120,046
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) Irish open 314,807
Italian Corpus for SkELL 1.0 Italian main 328,270,600
Italian Web (itWaC) Italian main 1,597,295,469
Italian Web 2010 (itTenTen) Italian main 2,588,873,046
Italian Web 2016 (itTenTen16) Italian trial 4,989,729,171
Italian Web 2016 sample Italian trial 201,204,942
itWAC (reduced) Italian main 751,542,948
Japanese Web (jpWaC) Japanese main 336,867,039
Japanese Web 2011 (jaTenTen11) Japanese trial 8,432,256,578
Japanese Web 2011 (jaTenTen11, sample) Japanese main 301,407,652
Japanese Web 2011 sample (jaTenTen11, LUW) Japanese trial 163,837,671
Kannada Web 2012 (knWaC12) Kannada trial 11,056,526
KAS-Dipl (diplome) Slovenian main 568,188,810
KAS-Dr (doktorati) Slovenian main 30,244,519
KAS-Mag (magisteriji) Slovenian main 157,168,378
Khmer Web 2018 (kmTenTen18) Khmer trial 16,500,379
Korean Web 2012 (koTenTen12) Korean main 461,196,240
Korean Web 2018 (koTenTen18) Korean trial 1,668,851,720
KSUCCA (Classical Arabic) Arabic main 46,705,577
Lao Web 2018 (loTenTen18) Lao trial 15,862,991
LatinISE historical corpus v2.2 Latin trial 11,036,900
Latvian Web (LatvianWaC) Latvian main 57,666,024
Latvian Web 2014 (lvTenTen14) Latvian trial 530,367,474
Lektor (Learner corpus of proofread and translations) Slovenian main 953,038
LEXMCI English main 1,448,180,339
Lithuanian Web (LithuanianWaC v2) Lithuanian main 48,650,918
Lithuanian Web 2014 (ltTenTen14) Lithuanian trial 778,151,979
Malayalam Web (malayalamWaC) Malayalam trial 15,950,663
Malaysian Web (MalaysianWaC) Malay trial 182,578,743
Maldivian Wikipedia corpus 2019 (dvwiki) Maldivian trial 548,211
Maltese MLRS Corpus Maltese trial 110,714,844
Maori Web (MaoriWaC) Maori trial 6,952,801
Medical Web Corpus English main 33,961,786
Mongolian Web Texts 2016 (mnWaC16) Mongolian trial 6,104,565
Multicultural London English Corpus English main 2,391,040
Nepali National Corpus Nepali trial 13,440,835
Nepali Web (NepaliWaC) Nepali main 1,290,388
New corpus for English (NCI English) English main 217,548,758
New Model Corpus English main 95,276,958
Newspapers in Portuguese (CetemPúblico, CetenFolha) Portuguese main 56,768,822
Norwegian dictionary corpus (Nynorskkorpuset) Norwegian (Mixed) main 74,496,664
Norwegian Web 2012 Norwegian (Mixed) main 669,511,569
Norwegian Web 2017 (noTenTen17, Bokmål) Norwegian Bokmål trial 2,472,483,911
Norwegian Web 2017 (noTenTen17, Nynorsk) Norwegian Nynorsk trial 174,830,652
OEC English access on demand 2,073,319,589
OEC v2 English access on demand 2,073,563,928
Open Access Journals (DOAJ - English) English trial 2,662,763,697
Open American National Corpus (spoken) English main 3,202,026
Open American National Corpus (written) English main 11,048,137
Open Cambridge Learner Corpus (Uncoded) English access on demand 2,975,701
Opus MontenegrinSubs: English English trial 468,337
Opus MontenegrinSubs: Montenegrin Montenegrin trial 365,698
OPUS2 Afrikaans Afrikaans main 586,334
OPUS2 Albanian Albanian trial 46,304,346
OPUS2 Arabic Arabic main 300,000,057
OPUS2 Bosnian Bosnian main 43,582,516
OPUS2 Brazilian Portuguese Portuguese main 272,300,927
OPUS2 Bulgarian Bulgarian main 183,115,244
OPUS2 Chinese Simplified Chinese Simplified main 243,427,123
OPUS2 Chinese Traditional Chinese Traditional main 380,245
OPUS2 Croatian Croatian main 121,369,625
OPUS2 Czech Czech main 203,845,619
OPUS2 Danish Danish main 120,107,271
OPUS2 Dutch Dutch main 356,363,571
OPUS2 English English main 1,139,515,048
OPUS2 Estonian Estonian main 64,879,741
OPUS2 Finnish Finnish main 131,985,872
OPUS2 French French main 766,833,908
OPUS2 German German main 125,229,773
OPUS2 Greek Greek main 239,360,926
OPUS2 Hebrew Hebrew main 130,972,343
OPUS2 Hindi Hindi main 854,741
OPUS2 Hungarian Hungarian main 157,495,018
OPUS2 Italian Italian main 180,532,849
OPUS2 Japanese Japanese main 5,455,106
OPUS2 Korean Korean main 374,850
OPUS2 Latvian Latvian main 24,499,516
OPUS2 Lithuanian Lithuanian main 29,621,940
OPUS2 Macedonian Macedonian trial 40,348,792
OPUS2 Norwegian Norwegian (Mixed) main 20,237,510
OPUS2 Persian Persian trial 4,425,133
OPUS2 Polish Polish main 208,008,636
OPUS2 Portuguese Portuguese main 297,700,205
OPUS2 Romanian Romanian main 282,408,295
OPUS2 Russian Russian main 307,709,872
OPUS2 Serbian Serbian main 153,237,786
OPUS2 Slovak Slovak main 62,451,407
OPUS2 Slovenian Slovenian main 121,228,966
OPUS2 Spanish Spanish main 111,497
OPUS2 Swedish Swedish main 102,298,686
OPUS2 Turkish Turkish main 151,342,424
OPUS2 Ukrainian Ukrainian main 2,578,289
Oromo Web 2016 (orWaC16) Oromo trial 4,249,953
Oxford Children's Corpus 2015 English access on demand 210,322,185
Oxford Children's Corpus 2015 -- Education English access on demand 1,323,174
Oxford Children's Corpus 2015 -- Reading English access on demand 34,284,687
Oxford Children's Corpus 2015 -- Writing English access on demand 174,714,324
Oxford Children's Corpus 2016 English access on demand 284,360,063
Oxford Children's Corpus 2016 -- Reading English access on demand 53,858,955
Oxford Children's Corpus 2016 -- Writing English access on demand 229,177,934
Oxford Corpus of Academic English (April 2012) English access on demand 71,372,972
Paisa Italian main 221,989,288
Parsed German Web (sDeWaC) German main 755,165,551
Penn Corpora of Historical English English access on demand 3,800,639
PICAE 2010 English access on demand 31,025,920
Polish Web (PolishWac, Morfeusz and TaKIPI tagger) Polish main 103,028,410
Polish Web 2012 (plTenTen12, RFTagger) Polish trial 7,715,835,214
Polish Web 2012 sample Polish trial 191,648,244
Portuguese Web 2011 (ptTenTen11) Portuguese trial 3,896,392,719
Portuguese Web 2011 (ptTenTen11, Palavras parsed) Portuguese main 2,757,635,105
Portuguese Web 2011 sample Portuguese trial 202,548,549
Project Gutenberg English English main 443,471,071
pukWaC (ukWaC parsed with MaltParser) English main 39,502,648
Quran annotated corpus [unvowelled Arabic] Arabic main 128,243
Quran annotated corpus [unvowelled Latin] Arabic main 99,268
Quran annotated corpus [vowelled Arabic] Arabic main 128,241
Quran annotated corpus [vowelled Latin] Arabic main 97,970
Riznica v0.1 Croatian main 85,273,724
Romanian Web 2016 (roTenTen16) Romanian trial 2,640,496,763
ruSkELL 1.6 Russian main 975,584,449
Russian Web 2011 (ruTenTen11) Russian trial 14,553,856,113
Russian Web 2011 sample (ruTenTen11) Russian trial 998,099,963
Russian web corpus (v2 with lempos) Russian main 147,930,261
Samoan Web (SamoanWac1) Samoan trial 3,115,385
ScienceBlogs English main 103,175,233
Scottish Gaelic Wiki 2015 (gdWiki) Scottish Gaelic trial 980,026
Semcor v3.0 (sense-tagged corpus) English main 664,038
Serbian Web (srWaC 1.2 processed by Hunpos) Serbian trial 477,724,164
Serbian Web (srWaC 1.2 processed by RFTagger v1) Serbian (Latin) trial 441,888,202
Serbian Web (srWaC 1.2) Serbian (Latin) trial 476,888,297
Setswana/Tswana Web (SetswanaWaC v2) Setswana trial 11,496,687
Slovak Web 2011 (skTenTen11) Slovak trial 540,112,634
Slovak Web 2011 (skTenTen11, ambiguity tag, lempos) Slovak main 715,707,053
Slovak Web 2011 sample Slovak trial 189,609,195
Slovenian reference corpus (FidaPLUS v2) Slovenian trial 600,309,670
Slovenian Web (slWaC 2.1 processed with TreeTagger v2) Slovenian trial 755,255,547
Slovenian Web (slWaC 2.1) Slovenian trial 754,255,589
Slovenian Web 2015 (slTenTen15, TreeTagger v2) Slovenian trial 829,544,337
Slovenian Web 2015 sample Slovenian trial 195,792,821
Somali Web 2016 (soWaC16) Somali trial 71,871,585
SoNaR Dutch access on demand 425,978,755
Spanish Web 2011 (esTenTen11, Eu + Am) Spanish trial 9,497,213,009
Spanish Web 2011 sample Spanish trial 212,142,794
Spanish Web 2018 (esTenTen18) Spanish trial 17,553,075,259
Spanish Web 2018 sample Spanish trial 177,257,648
Spanish Web corpus (SpanishWaC) Spanish main 97,773,185
Susanne English trial 128,998
Swahili Web 2014 (SwahiliWaC) Swahili trial 17,882,483
Swedish Web 2014 (svTenTen14) Swedish trial 3,401,035,817
SwedishParole Swedish main 21,735,113
Tagalog (Filipino) Web 2019 (tlTenTen19) Tagalog trial 197,908,842
Tajik Web (TajikWaC) Tajik trial 93,151,897
TalkBank Persian (blog posts) Persian main 474,773,547
Tamil Web 2015 (TamilWaC) Tamil trial 26,750,515
Tatar Mixed Corpus Tatar trial 102,779,803
Tatar News (2000-2014), version with lempos Tatar main 24,927,439
Tatar Web 2015 sample Tatar trial 195,901
Ted Talks transcripts English main 2,882,085
Telugu Web (TeluguWaC) Telugu trial 3,691,203
Thai Web (ThaiWaC) Thai trial 82,787,119
Thai Web 2018 (thTenTen18) Thai trial 640,530,227
The New Corpus for Ireland Irish main 29,886,201
Tibetan Corpus 2 Tibetan trial 80,613,567
Tigrinya Web 2016 (tiWaC16) Tigrinya trial 2,087,613
Timestamped JSI web corpus 2014-2016 Arabic Arabic trial 976,573,611
Timestamped JSI web corpus 2014-2016 Catalan Catalan trial 99,395,494
Timestamped JSI web corpus 2014-2016 Czech Czech trial 289,488,005
Timestamped JSI web corpus 2014-2016 Dutch Dutch trial 401,347,934
Timestamped JSI web corpus 2014-2016 English English trial 18,315,071,361
Timestamped JSI web corpus 2014-2016 Finnish Finnish trial 119,109,490
Timestamped JSI web corpus 2014-2016 French French trial 1,870,341,756
Timestamped JSI web corpus 2014-2016 German German trial 1,987,759,563
Timestamped JSI web corpus 2014-2016 Hebrew Hebrew trial 111,339,363
Timestamped JSI web corpus 2014-2016 Hungarian Hungarian trial 180,843,359
Timestamped JSI web corpus 2014-2016 Italian Italian trial 1,375,907,374
Timestamped JSI web corpus 2014-2016 Korean Korean trial 438,816,127
Timestamped JSI web corpus 2014-2016 Polish Polish trial 157,930,228
Timestamped JSI web corpus 2014-2016 Portuguese Portuguese trial 1,109,771,393
Timestamped JSI web corpus 2014-2016 Russian Russian trial 1,120,731,416
Timestamped JSI web corpus 2014-2016 Serbian Serbian trial 86,380,673
Timestamped JSI web corpus 2014-2016 Spanish Spanish trial 4,055,944,612
Timestamped JSI web corpus 2014-2016 Swedish Swedish trial 335,782,681
Timestamped JSI web corpus 2014-2019 Arabic Arabic main 2,852,498,557
Timestamped JSI web corpus 2014-2019 Catalan Catalan main 256,092,552
Timestamped JSI web corpus 2014-2019 Czech Czech main 610,667,197
Timestamped JSI web corpus 2014-2019 Dutch Dutch main 831,815,576
Timestamped JSI web corpus 2014-2019 English English main 40,669,324,757
Timestamped JSI web corpus 2014-2019 Finnish Finnish main 270,941,853
Timestamped JSI web corpus 2014-2019 French French main 4,353,947,792
Timestamped JSI web corpus 2014-2019 German German main 4,617,996,199
Timestamped JSI web corpus 2014-2019 Hebrew Hebrew main 293,090,716
Timestamped JSI web corpus 2014-2019 Hungarian Hungarian main 463,605,660
Timestamped JSI web corpus 2014-2019 Italian Italian main 3,648,950,969
Timestamped JSI web corpus 2014-2019 Korean Korean main 1,125,598,593
Timestamped JSI web corpus 2014-2019 Polish Polish main 421,121,286
Timestamped JSI web corpus 2014-2019 Portuguese Portuguese main 2,763,671,247
Timestamped JSI web corpus 2014-2019 Russian Russian main 3,196,159,370
Timestamped JSI web corpus 2014-2019 Serbian Serbian main 273,114,920
Timestamped JSI web corpus 2014-2019 Spanish Spanish main 8,287,688,488
Timestamped JSI web corpus 2014-2019 Swedish Swedish main 758,152,982
Timestamped JSI web corpus 2019-08 Arabic Arabic main 71,880,377
Timestamped JSI web corpus 2019-08 Catalan Catalan main 6,022,427
Timestamped JSI web corpus 2019-08 Czech Czech main 13,581,868
Timestamped JSI web corpus 2019-08 Dutch Dutch main 16,819,674
Timestamped JSI web corpus 2019-08 English English main 792,186,999
Timestamped JSI web corpus 2019-08 Finnish Finnish main 6,428,819
Timestamped JSI web corpus 2019-08 French French main 83,292,039
Timestamped JSI web corpus 2019-08 German German main 102,972,493
Timestamped JSI web corpus 2019-08 Hebrew Hebrew main 6,777,040
Timestamped JSI web corpus 2019-08 Hungarian Hungarian main 12,121,988
Timestamped JSI web corpus 2019-08 Italian Italian main 87,604,864
Timestamped JSI web corpus 2019-08 Korean Korean main 25,441,203
Timestamped JSI web corpus 2019-08 Polish Polish main 12,432,468
Timestamped JSI web corpus 2019-08 Portuguese Portuguese main 71,413,958
Timestamped JSI web corpus 2019-08 Russian Russian main 92,818,842
Timestamped JSI web corpus 2019-08 Serbian Serbian main 8,578,457
Timestamped JSI web corpus 2019-08 Spanish Spanish main 231,733,070
Timestamped JSI web corpus 2019-08 Swedish Swedish main 14,627,364
Timestamped JSI web corpus 2019-09 Arabic Arabic main 24,980,593
Timestamped JSI web corpus 2019-09 Catalan Catalan main 2,005,660
Timestamped JSI web corpus 2019-09 Czech Czech main 4,579,783
Timestamped JSI web corpus 2019-09 Dutch Dutch main 6,239,465
Timestamped JSI web corpus 2019-09 English English main 244,021,560
Timestamped JSI web corpus 2019-09 Finnish Finnish main 1,987,048
Timestamped JSI web corpus 2019-09 French French main 31,573,279
Timestamped JSI web corpus 2019-09 German German main 33,152,178
Timestamped JSI web corpus 2019-09 Hebrew Hebrew main 2,109,592
Timestamped JSI web corpus 2019-09 Hungarian Hungarian main 4,391,952
Timestamped JSI web corpus 2019-09 Italian Italian main 44,382,709
Timestamped JSI web corpus 2019-09 Korean Korean main 8,413,661
Timestamped JSI web corpus 2019-09 Polish Polish main 5,035,813
Timestamped JSI web corpus 2019-09 Portuguese Portuguese main 22,212,645
Timestamped JSI web corpus 2019-09 Russian Russian main 30,183,868
Timestamped JSI web corpus 2019-09 Serbian Serbian main 2,389,462
Timestamped JSI web corpus 2019-09 Spanish Spanish main 75,717,777
Timestamped JSI web corpus 2019-09 Swedish Swedish main 5,307,950
Turkic web – Azerbaijani Azerbaijani trial 94,267,206
Turkic web – Kazakh Kazakh trial 139,417,763
Turkic web – Kyrgyz Kyrgyz trial 19,369,507
Turkic web – Turkmen Turkmen trial 2,105,359
Turkic web – Uzbek Uzbek trial 18,720,334
Turkish Web (trWaC) Turkish main 32,791,491
Turkish Web 2012 (trTenTen12) Turkish trial 3,388,418,900
Ukrainian Web 2014 (ukTenTen14) Ukrainian trial 2,194,447,594
UKWaC super sensed English main 315,402,632
Urdu Web (UrduWaC) Urdu trial 53,269,273
Vietnamese Web (VietnameseWaC) Vietnamese trial 106,464,835
Welsh Web 2013 (WelshWaC) Welsh trial 12,458,397
Welsh web corpus Welsh main 50,392,441
Western Frisian Web 2013 (FrisianWaC) Frisian trial 3,116,119
Yiddish Wikipedia corpus 2018 (yiwiki) Yiddish trial 2,106,912
Yoruba Web 2015 (YorubaWaC15) Yoruba trial 2,816,965