The reference corpora are used in connection with keyword and term extraction. The corpus used as the source of the keywords has to be compared to a general corpus so that keywords and terms can be identified correctly. Here is a list of default reference corpora in Sketch Engine. The user can select a different reference corpus in the Keywords and Terms settings.

Language Default reference corpus Words Reference corpus for terms Words
Afrikaans Afrikaans Wikipedia corpus 2018 (afwiki) 14,466,792
Albanian OPUS2 Albanian 46,304,346
Amharic Amharic Web 2013-17 (amWaC17) 25,975,846
Arabic Arabic Web 2012 (arTenTen12, Stanford tagger) 7,475,624,779
Azerbaijani Turkic web – Azerbaijani 94,267,206
Basque Basque Web (BasqueWaC v2) 99,719,584
Belarusian Belarusian Web 2016 (beTenTen16) 63,327,264
Bengali Bengali Web (bnWaC) 11,519,730
Bosnian Bosnian Web (bsWaC 1.2) 248,478,730
Bulgarian Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) 705,156,683
Cantonese Cantonese Web (CantoneseWaC) 30,898,663
Catalan Catalan Web 2014 (caTenTen14 v2) 182,691,653
Chinese Simplified Chinese Web 2017 (zhTenTen17) Simplified 13,531,331,169 Chinese Simplified Web 2017 sample 250,361,047
Chinese Traditional Chinese Web 2017 (zhTenTen17) Traditional 2,400,405,372 Chinese Traditional Web 2017 (zhTenTen17) sample 239,882,651
Croatian Croatian Web (hrWaC 2.2, RFTagger) 1,211,328,660 Croatian Web (hrWaC 2.2, RFTagger) 1,211,328,660
Czech Czech Web 2017 (csTenTen17) 10,502,222,474 Czech Web 2017 sample 249,877,322
Danish Danish Web 2017 (daTenTen17) 2,170,690,492 Danish Web 2017 sample 214,447,970
Dutch Dutch Web 2014 (nlTenTen14) 2,253,777,579 Dutch Web 2014 (nlTenTen14) 2,253,777,579
English English Web 2015 (enTenTen15) 13,190,556,334 English Web 2015 (enTenTen15) 13,190,556,334
Estonian Estonian National Corpus 2019 (Estonian NC 2019) 1,500,284,681 Estonian National Corpus 2019 (Estonian NC 2019) 1,500,284,681
Filipino Filipino Web (FilipinoWaC) 26,991,049
Finnish Finnish Web 2014 (fiTenTen14) 1,404,083,812 Finnish Web 2014 (fiTenTen14) 1,404,083,812
French French Web 2017 (frTenTen17) 5,752,261,039 French Web 2017 sample 404,555,405
Frisian Western Frisian Web 2013 (FrisianWaC) 3,116,119
Georgian Georgian Web 2013 (kaWaC) 50,713,604
German German Web 2013 (deTenTen13) 16,526,335,416 German Web 2013 sample 193,838,751
Greek Greek Web 2014 (elTenTen14) 1,671,692,845
Gujarati Gujarati Web (guWaC) 17,960,095
Hausa (Boko) Hausa Web 2015 (hausaWaC15) 5,304,300
Hebrew Hebrew Web 2014 (heTenTen14, no POS tagging) 890,282,843
Hindi Hindi Web 2012 (HindiWaC v. 4) 107,960,109
Hungarian Hungarian Web 2012 (huTenTen12) 2,572,620,694
Icelandic Icelandic texts [sample] 5,436,035
Igbo Igbo Web 2015 (IgboWaC15) 331,042
Indonesian Indonesian Web (IndonesianWaC) 90,120,046
Irish New Corpus for Ireland (NCI Irish) 29,886,201
Italian Italian Web 2016 (itTenTen16) 4,989,729,171 Italian Web 2016 sample 201,204,942
Japanese Japanese Web 2011 sample (jaTenTen11, LUW) 163,837,671 Japanese Web 2011 sample (jaTenTen11, LUW) 163,837,671
Kannada Kannada Web 2012 (knWaC12) 11,056,526
Kazakh Turkic web – Kazakh 139,417,763
Khmer Khmer Web 2018 (kmTenTen18) 16,500,379
Korean Korean Web 2018 (koTenTen18) 1,668,851,720 Korean 2018 term reference corpus (koTenTen18_term_ref) 83,749,660
Kyrgyz Turkic web – Kyrgyz 19,369,507
Lao Lao Web 2018 (loTenTen18) 15,862,991
Latin LatinISE historical corpus v2.2 11,036,900
Latvian Latvian Web 2014 (lvTenTen14) 530,367,474
Lithuanian Lithuanian Web 2014 (ltTenTen14) 778,151,979
Macedonian OPUS2 Macedonian 40,348,792
Malay Malaysian Web (MalaysianWaC) 182,578,743
Malayalam Malayalam Web (malayalamWaC) 15,950,663
Maltese Maltese MLRS Corpus 110,714,844
Maori Maori Web (MaoriWaC) 6,952,801 Maori Web (MaoriWaC) 6,952,801
Nepali Nepali National Corpus 13,440,835
Norwegian (Mixed) Norwegian Web 2017 (noTenTen17, Bokmål) 2,472,483,911 Norwegian Web 2017 sample (Bokmål) 58,955,519
Norwegian Bokmål Norwegian Web 2017 (noTenTen17, Bokmål) 2,472,483,911 Norwegian Web 2017 sample (Bokmål) 58,955,519
Norwegian Nynorsk Norwegian Web 2017 (noTenTen17, Nynorsk) 174,830,652 Norwegian Web 2017 sample (Nynorsk) 58,743,828
Oromo Oromo Web 2016 (orWaC16) 4,249,953
Persian OPUS2 Persian 4,425,133
Polish Polish Web 2012 (plTenTen12, RFTagger) 7,715,835,214 Polish Web 2012 sample 191,648,244
Portuguese Portuguese Web 2011 (old word sketches) 3,896,392,719 Portuguese Web 2011 sample 202,548,549
Romanian Romanian Web 2016 (roTenTen16) 2,640,496,763
Russian Russian Web 2011 (ruTenTen11) 14,553,856,113 Russian Web 2011 sample (ruTenTen11) 998,099,963
Samoan Samoan Web (SamoanWac1) 3,115,385
Scottish Gaelic Scottish Gaelic Wiki 2015 (gdWiki) 980,026
Serbian Serbian Web (srWaC 1.2 processed by Hunpos) 477,724,164
Serbian (Latin) Serbian Web (srWaC 1.2 processed by RFTagger v1) 441,888,202 Serbian Web (srWaC 1.2 processed by RFTagger v1) 441,888,202
Setswana Setswana/Tswana Web (SetswanaWaC v2) 11,496,687
Slovak Slovak Web 2011 (skTenTen11) 540,112,634 Slovak Web 2011 sample 189,609,195
Slovenian Slovenian Web 2015 (slTenTen15, TreeTagger v2) 829,544,337 Slovenian Web 2015 sample 195,792,821
Somali Somali Web 2016 (soWaC16) 71,871,585
Spanish Spanish Web 2018 (esTenTen18) 17,553,075,259 Spanish Web 2018 sample 177,257,648
Swahili Swahili Web 2014 (SwahiliWaC) 17,882,483
Swedish Swedish Web 2014 (svTenTen14) 3,401,035,817 Swedish Web 2014 sample 45,477,881
Tagalog Tagalog (Filipino) Web 2018 (tlTenTen18) 151,164,040
Tajik Tajik Web (TajikWaC) 93,151,897
Tamil Tamil Web 2015 (TamilWaC) 26,750,515
Tatar Tatar Mixed Corpus 102,779,803
Telugu Telugu Web (TeluguWaC) 3,691,203
Thai Thai Web (ThaiWaC) 82,787,119
Tibetan Tibetan Corpus 2 80,613,567
Tigrinya Tigrinya Web 2016 (tiWaC16) 2,087,613
Turkish Turkish Web 2012 (trTenTen12) 3,388,418,900
Turkmen Turkic web – Turkmen 2,105,359
Ukrainian Ukrainian Web 2014 (ukTenTen14) 2,194,447,594
Urdu Urdu Web (UrduWaC) 53,269,273
Uzbek Turkic web – Uzbek 18,720,334
Vietnamese Vietnamese Web (VietnameseWaC) 106,464,835
Welsh Welsh Web 2013 (WelshWaC) 12,458,397
Yoruba Yoruba Web 2015 (YorubaWaC15) 2,816,965