The reference corpora are used in connection with keyword and term extraction. The corpus used as the source of the keywords has to be compared to a general corpus so that keywords and terms can be identified correctly. Here is a list of default reference corpora in Sketch Engine. The user can select a different reference corpus in the Keywords and Terms settings.

Language Default reference corpus Words Reference corpus for terms Words
Afrikaans Afrikaans Wikipedia corpus 2018 (afwiki) 14,466,792 None
Albanian OPUS2 Albanian 46,304,346 None
Amharic Amharic Web 2013-17 (amWaC17) 25,975,846 None
Arabic Arabic Web 2012 (arTenTen12, Stanford tagger) 7,475,624,779 None
Azerbaijani Turkic web – Azerbaijani 94,267,206 None
Basque Basque Web (BasqueWaC v2) 99,719,584 None
Belarusian Belarusian Web 2016 (beTenTen16) 63,327,264 None
Bengali Bengali Web (bnWaC) 11,519,730 None
Bosnian Bosnian Web (bsWaC 1.2) 248,478,730 None
Bulgarian Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) 705,156,683 None
Cantonese Cantonese Web (CantoneseWaC) 30,898,663 None
Catalan Catalan Web 2014 (caTenTen14 v2) 182,691,653 None
Croatian Croatian Web (hrWaC 2.2, RFTagger) 1,211,328,660 Croatian Web (hrWaC 2.2, RFTagger) 1,211,328,660
Czech Czech Web 2017 (csTenTen17) 10,502,222,474 Czech Web 2017 sample 249,877,322
Danish Danish Web 2017 (daTenTen17) 2,170,994,053 Danish Web 2017 sample 214,447,970
Dutch Dutch Web 2014 (nlTenTen14) 2,253,777,579 Dutch Web 2014 (nlTenTen14) 2,253,777,579
English English Web 2013 (enTenTen13) 19,685,733,337 English Web 2013 sample 204,976,089
Estonian Estonian Web 2013 (etTenTen13) 260,559,829 None
Filipino Filipino Web (FilipinoWaC) 26,991,049 None
Finnish Finnish Web 2014 (fiTenTen14, TreeTagger v2) 1,404,100,049 Finnish Web 2014 (fiTenTen14, TreeTagger v2) 1,404,100,049
Frisian Western Frisian Web 2013 (FrisianWaC) 3,116,119 None
Georgian Georgian Web 2013 (kaWaC) 50,713,604 None
German German Web 2013 (deTenTen13) 16,526,335,416 German Web 2013 sample 193,838,751
Greek Greek Web 2014 (elTenTen14) 1,671,692,845 None
Gujarati Gujarati Web (guWaC) 17,960,095 None
Hausa (Boko) Hausa Web 2015 (hausaWaC15) 5,304,300 None
Hebrew Hebrew Web 2014 (heTenTen14, no POS tagging) 890,282,843 None
Hindi Hindi Web (HindiWaC v. 4) 107,960,109 None
Hungarian Hungarian Web 2012 (huTenTen12) 2,572,620,694 None
Icelandic Icelandic texts [sample] 5,436,035 None
Igbo Igbo Web 2015 (IgboWaC15) 331,042 None
Indonesian Indonesian Web (IndonesianWaC) 89,893,285 None
Irish New Corpus for Ireland (NCI Irish) 29,886,201 None
Italian Italian Web 2016 (itTenTen16) 4,989,729,171 Italian Web 2016 sample 201,204,942
Japanese Japanese Web 2011 sample (jaTenTen11, LUW) 163,837,671 Japanese Web 2011 sample (jaTenTen11, LUW) 163,837,671
Kannada Kannada Web 2012 (knWaC12) 11,056,526 None
Kazakh Turkic web – Kazakh 139,417,763 None
Khmer Khmer Web 2018 (kmTenTen18) 16,500,379 None
Korean Korean Web 2018 (koTenTen18) 1,668,851,720 None
Kyrgyz Turkic web – Kyrgyz 19,369,507 None
Lao Lao Web 2018 (loTenTen18) 15,862,991 None
Latin LatinISE historical corpus v2.2 11,036,900 None
Latvian Latvian Web 2014 (lvTenTen14) 530,367,474 None
Lithuanian Lithuanian Web 2014 (ltTenTen14) 778,151,979 None
Macedonian OPUS2 Macedonian 40,348,792 None
Malay Malaysian Web (MalaysianWaC) 230,509,568 None
Malayalam Malayalam Web (malayalamWaC) 15,950,663 None
Maltese Maltese MLRS Corpus 110,714,844 None
Maori Maori Web (MaoriWaC) 6,952,801 Maori Web (MaoriWaC) 6,952,801
Nepali Nepali National Corpus 13,440,835 None
Norwegian (Mixed) Norwegian Web 2017 (noTenTen17, Bokmål) 2,472,483,911 Norwegian Web 2017 (noTenTen17, Bokmål) 2,472,483,911
Norwegian Bokmål Norwegian Web 2017 (noTenTen17, Bokmål) 2,472,483,911 [DEV] Norwegian Web 2017 (noTenTen17, Bokmål, DEV SAMPLE) 58,955,519
Norwegian Nynorsk Norwegian Web 2017 (noTenTen17, Nynorsk) 174,830,652 [DEV] Norwegian Web 2017 (noTenTen17, Nynorsk, DEV SAMPLE) 58,743,828
Oromo Oromo Web 2016 (orWaC16) 4,249,953 None
Persian OPUS2 Persian 4,425,133 None
Polish Polish Web 2012 (plTenTen12, RFTagger) 7,715,835,214 Polish Web 2012 sample 191,648,244
Portuguese Portuguese Web 2011 (ptTenTen11) 3,896,392,719 Portuguese Web 2011 sample 202,548,549
Romanian Romanian Web 2016 (roTenTen16) 2,640,496,763 None
Russian Russian Web 2011 (ruTenTen11) 14,553,856,113 Russian Web 2011 sample (ruTenTen11) 998,099,963
Samoan Samoan Web (SamoanWac1) 3,115,385 None
Scottish Gaelic Scottish Gaelic Wiki 2015 (gdWiki) 980,026 None
Serbian Serbian Web (srWaC 1.2 processed by Hunpos) 477,724,164 Serbian Web (srWaC 1.2 processed by Hunpos) 477,724,164
Serbian (Latin) Serbian Web (srWaC 1.2 processed by RFTagger v1) 441,888,202 Serbian Web (srWaC 1.2 processed by RFTagger v1) 441,888,202
Setswana Setswana/Tswana Web (SetswanaWaC v2) 11,496,687 None
Slovak Slovak Web 2011 (skTenTen11) 540,112,634 Slovak Web 2011 sample 189,609,195
Slovenian Slovenian Web 2015 (slTenTen15, TreeTagger v2) 829,544,337 Slovenian Web 2015 sample 195,792,821
Somali Somali Web 2016 (soWaC16) 71,871,585 None
Spanish Spanish Web 2011 (esTenTen11, Eu + Am) 9,497,213,009 Spanish Web 2011 sample 212,142,794
Swahili Swahili Web 2014 (SwahiliWaC) 17,882,483 None
Swedish Swedish Web 2014 (svTenTen14) 3,401,035,817 [DEV] Swedish Web 2014 (svTenTen14) -- sample 45,477,881
Tagalog Tagalog (Filipino) Web 2018 (tlTenTen18) 151,164,040 None
Tajik Tajik Web (TajikWaC) 93,151,897 None
Tamil Tamil Web 2015 (TamilWaC) 26,750,515 None
Tatar Tatar Mixed Corpus 102,779,803 None
Telugu Telugu Web (TeluguWaC) 3,691,203 None
Thai Thai Web (ThaiWaC) 82,787,119 None
Tibetan Tibetan Corpus 2 80,613,567 None
Tigrinya Tigrinya Web 2016 (tiWaC16) 2,087,613 None
Turkish Turkish Web 2012 (trTenTen12) 3,388,418,900 None
Turkmen Turkic web – Turkmen 2,105,359 None
Ukrainian Ukrainian Web 2014 (ukTenTen14) 2,194,447,594 None
Urdu Urdu Web (UrduWaC) 53,269,273 None
Uzbek Turkic web – Uzbek 18,720,334 None
Vietnamese Vietnamese Web (VietnameseWaC) 106,464,835 None
Welsh Welsh Web 2013 (WelshWaC) 12,458,397 None
Yoruba Yoruba Web 2015 (YorubaWaC15) 2,816,965 None