The Serbo-Croatian web corpora are language corpora made up of texts collected from the Internet. Sketch Engine includes offer Bosnian, Croatian, Serbian corpora obtained from the web by Nikola Ljubešić and Filip Klubička in 2011 and 2013. Corpora were built using following steps:
data obtained from the web using Brno web corpus processing pipeline (SpiderLing, chared, jusText, onion);
lemmatised by CST’s Lemmatiser (Jongejan and Dalianis, 2009);
morphosyntactic tagging with HunPos12 (Halácsy et al., 2007);
all models trained on the Croatian 90k-token annotated corpus SETimes.HR14 (Agić and Ljubešić, 2014).
Each corpus was annotated with the MULTEXT-East Morphosyntactic Specifications version 5 with small modifications for each language.