The reference corpora are used in connection with keyword and term extraction. The corpus used as the source of the keywords has to be compared to a general corpus so that keywords and terms can be identified correctly. Here is a list of default reference corpora in Sketch Engine. The user can select a different reference corpus in the Keywords and Terms settings.

Language Default reference corpus Words
Afrikaans Afrikaans Wikipedia corpus 2018 (afwiki) 14,466,792
Albanian OPUS2 Albanian 46,304,346
Amharic Amharic Web 2013-17 (amWaC17) 25,975,846
Arabic Arabic Web 2012 (arTenTen12, Stanford tagger) 7,475,624,779
Azerbaijani Turkic web – Azerbaijani 94,267,206
Basque Basque Web (BasqueWaC v2) 99,719,584
Belarusian Belarusian Web 2016 (beTenTen16) 63,327,264
Bengali Bengali Web (bnWaC) 11,519,730
Bosnian Bosnian Web (bsWaC 1.2) 248,478,730
Bulgarian Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) 705,156,683
Cantonese Cantonese Web (CantoneseWaC) 30,898,663
Catalan Catalan Web 2014 (caTenTen14 v2) 182,691,653
Chinese Simplified Chinese Web 2017 (zhTenTen17) Simplified 13,531,331,169
Chinese Traditional Chinese Web 2017 (zhTenTen17) Traditional 2,400,405,372
Croatian Croatian Web (hrWaC 2.2, RFTagger) 1,211,328,660
Czech Czech Web 2017 (csTenTen17) 10,502,222,474
Danish Danish Web 2020 (daTenTen20) 3,480,275,804
Dutch Dutch Web 2014 (nlTenTen14) 2,253,777,579
English English Web 2018 (enTenTen18) 21,926,740,748
Estonian Estonian Web 2019 (etTenTen19) 508,447,009
Filipino Filipino Web (FilipinoWaC) 26,991,049
Finnish Finnish Web 2014 (fiTenTen14) 1,404,083,812
French French Web 2017 (frTenTen17) 5,752,261,039
Frisian Western Frisian Web 2013 (FrisianWaC) 3,116,119
Georgian Georgian Web 2013 (kaWaC) 50,713,604
German German Web 2013 (deTenTen13) 16,526,335,416
Greek Greek Web 2014 (elTenTen14) 1,671,692,845
Gujarati Gujarati Web (guWaC) 17,960,095
Hausa (Boko) Hausa Web 2015 (hausaWaC15) 5,304,300
Hebrew Hebrew Web 2014 (heTenTen14, no POS tagging) 890,282,843
Hindi Hindi Web 2021 (hiTenTen21) 1,666,964,163
Hungarian Hungarian Web 2012 (huTenTen12) 2,572,620,694
Icelandic Icelandic texts [sample] 5,436,035
Igbo Igbo Web 2015 (IgboWaC15) 331,042
Indonesian Indonesian Web (IndonesianWaC) 90,120,046
Irish New Corpus for Ireland (NCI Irish) 29,886,201
Italian Italian Web 2016 (itTenTen16) 4,989,729,171
Japanese Japanese Web 2011 sample (jaTenTen11, LUW) 163,837,671
Kannada Kannada Web 2012 (knWaC12) 11,056,526
Kazakh Turkic web – Kazakh 139,417,763
Khmer Khmer Web 2018 (kmTenTen18) 16,500,379
Korean Korean Web 2018 (koTenTen18) 1,668,851,720
Kyrgyz Turkic web – Kyrgyz 19,369,507
Lao Lao Web 2018 (loTenTen18) 15,862,991
Latin LatinISE historical corpus v2.2 11,036,900
Latvian Latvian Web 2014 (lvTenTen14) 530,367,474
Lithuanian Lithuanian Web 2014 (ltTenTen14) 778,151,979
Macedonian OPUS2 Macedonian 40,348,792
Malay Malaysian Web (MalaysianWaC) 182,578,743
Malayalam Malayalam Web (malayalamWaC) 15,950,663
Maltese Maltese MLRS Corpus 110,714,844
Maori Maori Web (MaoriWaC) 6,952,801
Nepali Nepali National Corpus 13,440,835
Norwegian (Mixed) Norwegian Web 2017 (noTenTen17, Bokmål) 2,472,483,911
Norwegian Bokmål Norwegian Web 2017 (noTenTen17, Bokmål) 2,472,483,911
Norwegian Nynorsk Norwegian Web 2017 (noTenTen17, Nynorsk) 174,830,652
Oromo Oromo Web 2016 (orWaC16) 4,249,953
Persian OPUS2 Persian 4,425,133
Polish Polish Web 2012 (plTenTen12, RFTagger) 7,715,835,214
Portuguese Portuguese Web 2011 (ptTenTen11) 3,896,392,719
Romanian Romanian Web 2016 (roTenTen16) 2,640,496,763
Russian Russian Web 2011 (ruTenTen11) 14,553,856,113
Samoan Samoan Web (SamoanWac1) 3,115,385
Scottish Gaelic Scottish Gaelic Wiki 2015 (gdWiki) 980,026
Serbian Serbian Web (srWaC 1.2 processed by Hunpos) 477,724,164
Serbian (Latin) Serbian Web (srWaC 1.2 processed by RFTagger v1) 441,888,202
Setswana Setswana/Tswana Web (SetswanaWaC v2) 11,496,687
Slovak Slovak Web 2011 (skTenTen11) 540,112,634
Slovenian Slovenian Web 2015 (slTenTen15, TreeTagger v2) 829,544,337
Somali Somali Web 2016 (soWaC16) 71,871,585
Spanish Spanish Web 2018 (esTenTen18) 17,553,075,259
Swahili Swahili Web 2014 (SwahiliWaC) 17,882,483
Swedish Swedish Web 2014 (svTenTen14) 3,401,035,817
Tagalog Tagalog (Filipino) Web 2018 (tlTenTen18) 151,164,040
Tajik Tajik Web (TajikWaC) 93,151,897
Tamil Tamil Web 2015 (TamilWaC) 26,750,515
Tatar Tatar Mixed Corpus 102,779,803
Telugu Telugu Web (TeluguWaC) 3,691,203
Thai Thai Web (ThaiWaC) 82,787,119
Tibetan Tibetan Corpus 2 80,613,567
Tigrinya Tigrinya Web 2016 (tiWaC16) 2,087,613
Turkish Turkish Web 2012 (trTenTen12) 3,388,418,900
Turkmen Turkic web – Turkmen 2,105,359
Ukrainian Ukrainian Web 2014 (ukTenTen14) 2,194,447,594
Urdu Urdu Web (UrduWaC) 53,269,273
Uzbek Turkic web – Uzbek 18,720,334
Vietnamese Vietnamese Web (VietnameseWaC) 106,464,835
Welsh Welsh Web 2013 (WelshWaC) 12,458,397
Yoruba Yoruba Web 2015 (YorubaWaC15) 2,816,965