The reference corpora are used in connection with term extraction. The corpus used as the source of the keywords has to be compared to a general corpus so that keywords and terms can be identified correctly. Here is a list of such reference corpora which will be used for term extraction by default. The user can change the reference corpora in the Keywords and Terms settings.
Reference corpus | Language | Tokens |
---|---|---|
German Web 2013 (deTenTen13, RFTagger v2) | German | 19,918,263,493 |
Russian Web 2011 sample (ruTenTen11) | Russian | 1,253,892,814 |
Polish Web 2012 (plTenTen12) | Polish | 9,387,142,186 |
European Spanish Web 2011 (eseuTenTen11) | Spanish | 2,343,829,757 |
Portuguese Web 2011 (ptTenTen11, Freeling v3, old) | Portuguese | 4,637,901,353 |
Japanese Web 2011 sample (jpTenTen11, LUW) | Japanese | 203,674,569 |
Korean Web 2012 sample (koTenTen12) | Korean | 43,113,814 |
Czech Web 2012 (czTenTen12 v8, sample) | Czech | 64,607,138 |
Slovak Web 2011 (skTenTen11) | Slovak | 656,067,998 |
Slovenian Web 2015 (slTenTen15) | Slovenian | 988,513,467 |
Chinese Web 2011 (zhTenTen11) | Chinese Simplified | 2,106,661,021 |
Chinese Web 2011 (zhTenTen11) | Chinese Traditional | 2,106,661,021 |
Dutch Web 2014 (nlTenTen14) | Dutch | 3,013,056,738 |
Italian Web 2010 sample (itTenTen) | Italian | 48,904,255 |
French Web 2012 (frTenTen12) | French | 11,444,973,582 |
British National Corpus (BNC) | English | 112,345,722 |