To cite Sketch Engine in academic publications, use the following papers. If you refer to Sketch Engine in general, choose from the papers in General references.

General Reference

Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel. The Sketch Engine: ten years on. Lexicography, 1: 7-36, 2014.
@article{kilgarriff2014sketch,
  title={The Sketch Engine: ten years on},
  author={Kilgarriff, Adam and Baisa, Vít and Bušta, Jan and Jakubíček, Miloš and Kovář, Vojtěch and Michelfeit, Jan and Rychlý, Pavel and Suchomel, Vít},
  journal={Lexicography},
  year={2014},
  volume={1},
  pages={7--36},
  publisher={Springer}
}

Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, David Tugwell. The Sketch Engine. Proceedings of the 11th EURALEX International Congress: 105-116, 2004.
@article{kilgarriff2004sketch,
  title={The Sketch Engine},
  author={Kilgarriff, Adam and Rychlý, Pavel and Smrž, Pavel and Tugwell, David},
  journal={Proceedings of the 11th EURALEX International Congress},
  year={2004},
  volume={},
  pages={105--116},
  publisher={Université de Bretagne-Sud, Faculté des lettres et des sciences humaines}
}

Also please mention the following web address http://www.sketchengine.eu

logDice statistic

A statistic measure used to compute word sketches since 2008.

Pavel Rychlý. A Lexicographer-Friendly Association Score. Proc. 2nd Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN, 2: 6-9, 2008.
@article{rychlý2008lexicographer,
  title={A Lexicographer-Friendly Association Score},
  author={Rychlý, Pavel},
  journal={Proc. 2nd Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN},
  year={2008},
  volume={2},
  pages={6--9},
  publisher={Masaryk University}
}

Evaluation of Word Sketches

Adam Kilgarriff, Vojtěch Kovář, Simon Krek, Irena Srdanović, Carole Tiberius. A quantitative evaluation of word sketches. Proceedings of the 14th EURALEX International Congress: 372-79, 2010.
@article{kilgarriff2010quantitative,
  title={A quantitative evaluation of word sketches},
  author={Kilgarriff, Adam and Kovář, Vojtěch and Krek, Simon and Srdanović, Irena and Tiberius, Carole},
  journal={Proceedings of the 14th EURALEX International Congress},
  year={2010},
  volume={},
  pages={372--79},
  publisher={Fryske Akademy-Afûk}
}

All statistics used in Sketch Engine

To see a paper with details on Statistics used in the Sketch Engine, visit the relevant page in the Research section.

Corpus query language (CQL)

Miloš Jakubíček, Adam Kilgarriff, Diana McCarthy, Pavel Rychlý. Fast Syntactic Searching in Very Large Corpora for Many Languages. PACLIC: 741-47, 2010.
@article{jakubíček2010fast,
  title={Fast Syntactic Searching in Very Large Corpora for Many Languages},
  author={Jakubíček, Miloš and Kilgarriff, Adam and McCarthy, Diana and Rychlý, Pavel},
  journal={PACLIC},
  year={2010},
  volume={},
  pages={741--47},
  publisher={Tohuku University}
}

Full bibliography of Sketch Engine

2023

  • M. Blahuš, M. Cukr, O. Herman, M. Jakubíček, V. Kovář, J. Kraus, M. Medveď, V. Ohlídalová. Rapid Ukrainian-English Dictionary Creation Using Post-Edited Corpus Data. Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference: 613-637, 2023.
    @article{blahuš2023rapid,
      title={Rapid Ukrainian-English Dictionary Creation Using Post-Edited Corpus Data},
      author={Blahuš, M. and Cukr, M. and Herman, O. and  Jakubíček, M. and Kovář, V. and Kraus,  J. and Medveď, M. and Ohlídalová, V.},
      journal={Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference},
      year={2023},
      pages={613--637},
      publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.}
    }
  • M. Blahuš, M. Jakubíček, M. Cukr, V. Kovář, V. Suchomel. Development of Evidence-Based Grammars for Terminology Extraction in OneClick Terms. Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference: 650-662, 2023.
    @article{blahuš2023development,
      title={Development of Evidence-Based Grammars for Terminology Extraction in OneClick Terms},
      author={Blahuš, M. and Jakubíček, M. and Cukr, M. and Kovář, V. and Suchomel, V.},
      journal={Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference},
      year={2023},
      pages={650--662},
      publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.}
    }
  • V. Suchomel, M. Jakubíček, O. Matuška. Web corpora for under-resourced languages. Corpus Linguistics (CL2023), 2023.
    @article{ suchomel2023web,
      title={Web corpora for under-resourced languages},
      author={ Suchomel, V. and Jakubíček, M. and Matuška, O.},
      journal={Corpus Linguistics (CL2023)},
      year={2023},
      pages={},
      publisher={Lancaster, United Kingdom: Lancaster University}
    }
  • M. Jakubíček, O. Matuška, M. Blahuš. Corpus-based Bilingual Terminology Extraction using One-Click Terms. Corpus Linguistics (CL2023), 2023.
    @article{jakubíček2023corpus,
      title={Corpus-based Bilingual Terminology Extraction using One-Click Terms},
      author={Jakubíček, M. and Matuška, O. and Blahuš, M.},
      journal={Corpus Linguistics (CL2023)},
      year={2023},
      pages={},
      publisher={Lancaster, United Kingdom: Lancaster University}
    }
  • Antonio San Martín, Catherine Trekker, Juan Carlos Díaz-Bautista. Extracting the Agent-Patient Relation from Corpus With Word Sketches. Proceedings of the 4th Conference on Language, Data and Knowledge: 666-675, 2023.
    @article{san martín2023extracting,
      title={Extracting the Agent-Patient Relation from Corpus With Word Sketches},
      author={San Martín, Antonio and Trekker, Catherine and Díaz-Bautista, Juan Carlos},
      journal={Proceedings of the 4th Conference on Language, Data and Knowledge},
      year={2023},
      pages={666--675},
      publisher={}
    }

2022

  • Miloš Jakubíček, Vojtěch Kovář, Michal Měchura, Adam Rambousek. Using NVH as a Backbone Format in the Lexonomy Dictionary Editor. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 55-61, 2022.
    @article{jakubíček2022using,
      title={Using NVH as a Backbone Format in the Lexonomy Dictionary Editor},
      author={Jakubíček, Miloš and Kovář, Vojtěch and Měchura, Michal and Rambousek, Adam},
      journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022},
      year={2022},
      pages={55--61},
      publisher={Tribun EU}
    }
  • Vladimír Benko. Aranea Go Middle East: Persicum. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 113-121, 2022.
    @article{benko2022aranea,
      title={Aranea Go Middle East: Persicum},
      author={Benko, Vladimír},
      journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022},
      year={2022},
      pages={113--121},
      publisher={Tribun EU}
    }
  • Matúš Kostka. Pipeline Effectiveness in the Sketch Engine. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 123-130, 2022.
    @article{kostka2022pipeline,
      title={Pipeline Effectiveness in the Sketch Engine},
      author={Kostka, Matúš},
      journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022},
      year={2022},
      pages={123--130},
      publisher={Tribun EU}
    }
  • Vít Suchomel, Jan Kraus. Semi-Manual Annotation of Topics and Genres in Web Corpora, The Cheap and Fast Way. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 141-148, 2022.
    @article{suchomel2022semi,
      title={Semi-Manual Annotation of Topics and Genres in Web Corpora, The Cheap and Fast Way},
      author={Suchomel, Vít and Kraus, Jan},
      journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022},
      year={2022},
      pages={141--148},
      publisher={Tribun EU}
    }
  • Antonio San Martín, Catherine Trekker, Pilar León-Araúz. Repérage automatisé de l’hyponymie dans des corpus spécialisés en français à l’aide de Sketch Engine. Terminology, 2022.
    @article{san martín2022repérage,
      title={Repérage automatisé de l’hyponymie dans des corpus spécialisés en français à l’aide de Sketch Engine},
      author={San Martín, Antonio and Trekker, Catherine and León-Araúz, Pilar},
      journal={Terminology},
      year={2022},
      publisher={John Benjamins}
    }
  • Virginijus Dadurkevicius, Andrius Utka. Estimating the Amount of Lithuanian Text Indexed by Global Search Engines. Baltic Journal of Modern Computing (BJMC): 326-336, 2022.
    @article{dadurkevicius2022estimating,
      title={Estimating the Amount of Lithuanian Text Indexed by Global Search Engines},
      author={Dadurkevicius, Virginijus and Utka, Andrius},
      journal={Baltic Journal of Modern Computing (BJMC)},
      year={2022},
      pages={326--336},
      publisher={University of Latvia}
    }

2021

  • Vít Suchomel. Genre Annotation of Web Corpora: Scheme and Issues. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1: 738-754, 2021.
    @article{suchomel2021genre,
      title={Genre Annotation of Web Corpora: Scheme and Issues},
      author={Suchomel, Vít},
      journal={Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1},
      year={2021},
      pages={738--754},
      publisher={Springer International Publishing}
    }
  • Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý. Million-Click Dictionary: Tools and Methods for Automatic Dictionary Drafting and Post-Editing. Book of Abstracts of the 19th EURALEX International Congress: 65-67, 2021.
    @article{jakubíček2021million,
      title={Million-Click Dictionary: Tools and Methods for Automatic Dictionary Drafting and Post-Editing},
      author={Jakubíček, Miloš and Kovář, Vojtěch and Rychlý, Pavel},
      journal={Book of Abstracts of the 19th EURALEX International Congress},
      year={2021},
      pages={65--67},
      publisher={SynMorPhoSe Lab, Democritus University of Thrace}
    }
  • Vít Suchomel, Jan Kraus. Website Properties in Relation to the Quality of Text Extracted for Web Corpora. Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021: 167-175, 2021.
    @article{suchomel2021website,
      title={Website Properties in Relation to the Quality of Text Extracted for Web Corpora},
      author={Suchomel, Vít and Kraus, Jan},
      journal={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021},
      year={2021},
      pages={167--175},
      publisher={Tribun EU}
    }
  • Ondřej Herman. Precomputed Word Embeddings for 15+ Languages. Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021: 41-46, 2021.
    @article{herman2021precomputed,
      title={Precomputed Word Embeddings for 15+ Languages},
      author={Herman, Ondřej},
      journal={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021},
      year={2021},
      pages={41--46},
      publisher={Tribun EU}
    }
  • A. Rambousek, M. Jakubíček, Iztok Kosem. New developments in Lexonomy. Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference: 455-462, 2021.
    @article{rambousek2021new,
      title={New developments in Lexonomy},
      author={Rambousek, A. and Jakubíček, M. and Kosem, Iztok},
      journal={Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference},
      year={2021},
      pages={455--462},
      publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.}
    }
  • Miloš Jakubíček, Emma Romani, Pavel Rychlý, Ondřej Herman. Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset. Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021: 177-183, 2021.
    @article{jakubíček2021development,
      title={Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset},
      author={Jakubíček, Miloš and Romani, Emma and Rychlý, Pavel and Herman, Ondřej},
      journal={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021},
      year={2021},
      pages={177--183},
      publisher={Tribun EU}
    }
  • Antonio San Martín, Catherine Trekker. Adapting Word Sketches for Specialized Knowledge Extraction. Proceedings of the 14th International Conference of the Asian Association for Lexicography (ASIALEX 2021): 64-87, 2021.
    @article{san martín2021adapting,
      title={Adapting Word Sketches for Specialized Knowledge Extraction},
      author={San Martín, Antonio and Trekker, Catherine},
      journal={Proceedings of the 14th International Conference of the Asian Association for Lexicography (ASIALEX 2021)},
      year={2021},
      pages={64--87},
      publisher={Jakarta, Indonesia: Asialex}
    }

2020

  • Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel. Current Challenges in Web Corpus Building. Proceedings of the 12th Web as Corpus Workshop: 1-4, 2020.
    @article{jakubíček2020current,
      title={Current Challenges in Web Corpus Building},
      author={Jakubíček, Miloš and Kovář, Vojtěch and Rychlý, Pavel and Suchomel, Vít},
      journal={Proceedings of the 12th Web as Corpus Workshop},
      year={2020},
      pages={1--4},
      publisher={Marseille, France: European Language Resources Association}
    }
  • Antonio San Martín, Catherine Trekker, Pilar León-Araúz. Extraction of Hyponymic Relations in French with Knowledge-Pattern-Based Word Sketches. Proceedings of The 12th Language Resources and Evaluation Conference: 5955-5963, 2020.
    @article{martín2020extraction,
      title={Extraction of Hyponymic Relations in French with Knowledge-Pattern-Based Word Sketches},
      author={Martín, Antonio San and Trekker, Catherine and León-Araúz, Pilar},
      journal={Proceedings of The 12th Language Resources and Evaluation Conference},
      year={2020},
      pages={5955--5963}
    }

2019

  • V. Baisa, M. Blahuš, M. Cukr, O. Herman, M. Jakubíček, V. Kovář, M. Medveď, M. Měchura, P. Rychlý, V. Suchomel. Automating dictionary production: a Tagalog-English-Korean dictionary from scratch. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2019.
    @article{baisa2019automating,
      title={Automating dictionary production: a Tagalog-English-Korean dictionary from scratch},
      author={Baisa, V. and Blahuš, M. and Cukr, M. and Herman, O. and Jakubíček, M. and Kovář, V. and Medveď, M. and Měchura, M. and Rychlý, P. and Suchomel, V.},
      journal={Proceedings of the 6th Biennial Conference on Electronic Lexicography},
      year={2019},
      pages={},
      publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.}
    }
  • Kristina Koppel, Jelena Kallas, Maria Khokhlová, Vít Suchomel, Vít Baisa, Jan Michelfeit. SkELL Corpora as a Part of the Language Portal Sonaveeb: Problems and Perspectives. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2019.
    @article{koppel2019skell,
      title={SkELL Corpora as a Part of the Language Portal Sonaveeb: Problems and Perspectives},
      author={Koppel, Kristina and Kallas, Jelena and Khokhlová, Maria and Suchomel, Vít and Baisa, Vít and Michelfeit, Jan},
      journal={Proceedings of the 6th Biennial Conference on Electronic Lexicography},
      year={2019},
      pages={},
      publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.}
    }
  • Ondřej Herman, Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý. Word Sense Induction Using Word Sketches. Proceedings of the 7th International Conference on Statistical Language and Speech Processing: 83-91, 2019.
    @article{herman2019word,
      title={Word Sense Induction Using Word Sketches},
      author={Herman, Ondřej and Jakubíček, Miloš and Kovář, Vojtěch and Rychlý, Pavel},
      journal={Proceedings of the 7th International Conference on Statistical Language and Speech Processing},
      year={2019},
      pages={83--91},
      publisher={Cham, Switzerland: Springer}
    }
  • Miloš Jakubíček, Pavel Rychlý. A Distributional Multi-word Thesaurus in Sketch Engine. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019: 143-147, 2019.
    @article{jakubíček2019distributional,
      title={A Distributional Multi-word Thesaurus in Sketch Engine},
      author={Jakubíček, Miloš and Rychlý, Pavel},
      journal={Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019},
      year={2019},
      pages={143--147},
      publisher={Tribun EU}
    }

2018

  • M. Jakubíček, M. Měchura, V. Kovář, P. Rychlý. Practical Post- Editing Lexicography with Lexonomy and Sketch Engine. XVIII EURALEX International Congress: Lexicography in Global Contexts, 2018.
    @article{jakubíček2018practical,
      title={Practical Post- Editing Lexicography with Lexonomy and Sketch Engine},
      author={Jakubíček, M. and Měchura, M. and Kovář, V. and Rychlý, P.},
      journal={XVIII EURALEX International Congress: Lexicography in Global Contexts},
      year={2018},
      pages={},
      publisher={Ljubljana University Press, Faculty of Arts}
    }
    (CC BY SA 4.0 The XVIII EURALEX International Congress: Lexicography in Global Contexts Book of Abstracts)
  • Vít Suchomel. csTenTen17, a Recent Czech Web Corpus. Twelfth Workshop on Recent Advances in Slavonic Natural Language Processing: 111-123, 2018.
    @article{suchomel2018cstenten17,,
      title={csTenTen17, a Recent Czech Web Corpus},
      author={Suchomel, Vít},
      journal={Twelfth Workshop on Recent Advances in Slavonic Natural Language Processing},
      year={2018},
      pages={111--123},
      publisher={Tribun EU}
    }
  • Pavel Rychlý, Radoslav Rábara, Ondřej Herman. Distributed Corpus Search. 6th Workshop on the Challenges in the Management of Large Corpora (LREC 2018 Workshop): 10-13, 2018.
    @article{rychlý2018distributed,
      title={Distributed Corpus Search},
      author={Rychlý, Pavel and Radoslav Rábara and Ondřej Herman},
      journal={6th Workshop on the Challenges in the Management of Large Corpora (LREC 2018 Workshop)},
      year={2018},
      pages={10--13},
      publisher={Japan: European Language Resource Association}
    }
  • Gezahegn Tsegaye Lemma, Pavel Rychlý. An Update of the Manually Annotated Amharic Corpus. Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018: 124-128, 2018.
    @article{lemma2018update,
      title={An Update of the Manually Annotated Amharic Corpus},
      author={Lemma, Gezahegn Tsegaye and Rychlý, Pavel},
      journal={Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018},
      year={2018},
      pages={124--128},
      publisher={Brno: Tribun EU}
    }
  • Vít Suchomel. Building Large Corpora From The Web (presentation slides). Presented at Web Corpora as a Language Training Tool organised by Faculty of Arts of Comenius University in Bratislava, Linguistic Institute of the Slovak Academy of Sciences on November 23, 2018.
    [Download PDF]
  • [ERROR] Missing attributes: journal
  • Pilar León-Araúz, Antonio San Martín, Arianne Reimerink. The EcoLexicon English Corpus as an Open Corpus in Sketch Engine. XVIII EURALEX International Congress: Lexicography in Global Contexts: 893-901, 2018.
    @article{león-araúz2018ecolexicon,
      title={The EcoLexicon English Corpus as an Open Corpus in Sketch Engine},
      author={León-Araúz, Pilar and Martín, Antonio San and Arianne Reimerink},
      journal={XVIII EURALEX International Congress: Lexicography in Global Contexts},
      year={2018},
      pages={893--901},
      publisher={Ljubljana University Press, Faculty of Arts}
    }
  • Pilar León-Araúz, Antonio San Martín. The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to Word Sketches. Proceedings of the LREC 2018 Workshop “Globalex 2018 – Lexicography & WordNets”: 94-99, 2018.
    @article{león-araúz2018ecolexicon,
      title={The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to Word Sketches},
      author={León-Araúz, Pilar and Martín, Antonio San},
      journal={Proceedings of the LREC 2018 Workshop “Globalex 2018 – Lexicography \& WordNets”},
      year={2018},
      pages={94--99},
      publisher={Miyazaki: Globalex}
    }

2017

  • Miloš Jakubíček. The advent of post-editing lexicography. Kernerman Dictionary News, 25: 14-15, 2017.
    @article{jakubíček2017advent,
      title={The advent of post-editing lexicography},
      author={Jakubíček, Miloš},
      journal={Kernerman Dictionary News},
      year={2017},
      volume={25},
      pages={14--15},
      publisher={Kernerman Dictionary}
    }
  • Jelena Kallas, Vít Suchomel, Maria Khokhlova. Automated Identification of Domain Preferences of Collocations. Electronic Lexicography in the 21st Century. Proceedings of Elex 2017 Conference, 5: 30-320, 2017.
    @article{kallas2017automated,
      title={Automated Identification of Domain Preferences of Collocations},
      author={Kallas, Jelena and Vít Suchomel and Maria Khokhlova},
      journal={Electronic Lexicography in the 21st Century. Proceedings of Elex 2017 Conference},
      year={2017},
      volume={5},
      pages={30--320},
      publisher={Lexical Computing CZ s.r.o.}
    }
  • R. Rábara, P. Rychlý, O. Herman, M. Jakubíček. Accelerating Corpus Search Using Multiple Cores. Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+ BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI), 30: 30-34, 2017.
    @article{rábara2017accelerating,
      title={Accelerating Corpus Search Using Multiple Cores},
      author={Rábara, R. and Rychlý, P. and Herman, O. and Jakubíček, M.},
      journal={Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+ BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI), 30},
      year={2017},
      volume={},
      pages={30--34},
      publisher={Institut für Deutsche Sprache}
    }
  • J. Bušta, O. Herman, Jakubíček M., S. Krek, B. Novak. JSI Newsfeed Corpus. The 9th International Corpus Linguistics Conference, 2017.
    @article{bušta2017jsi,
      title={JSI Newsfeed Corpus},
      author={Bušta, J. and Herman, O. and Jakubíček M. and Krek, S. and Novak, B.},
      journal={ The 9th International Corpus Linguistics Conference},
      year={2017},
      publisher={University of Birmingham}
    }
  • V. Baisa, J. Michelfeit, O. Matuška. Simplifying terminology extraction: OneClick Terms. The 9th International Corpus Linguistics Conference, 2017.
    @article{baisa2017simplifying,
      title={Simplifying terminology extraction: OneClick Terms},
      author={Baisa, V. and  Michelfeit, J. and Matuška, O.},
      journal={ The 9th International Corpus Linguistics Conference},
      year={2017},
      publisher={University of Birmingham}
    }
  • A. O. Anić, S. K. Žuvela. The conceptualization of music in semantic frames based on word sketches. The 9th International Corpus Linguistics Conference, 2017.
    @article{anić2017conceptualization,
      title={The conceptualization of music in semantic frames based on word sketches},
      author={Anić, A. O. and Žuvela, S. K.},
      journal={ The 9th International Corpus Linguistics Conference},
      year={2017},
      publisher={University of Birmingham}
    }
  • M. Kunilovskaya, M. Koviazina. Sketch Engine: A Toolbox for Linguistic Discovery. Journal of Linguistics/Jazykovedný casopis, 68(3): 503-507, 2017.
    @article{kunilovskaya2017sketch,
      title={Sketch Engine: A Toolbox for Linguistic Discovery},
      author={Kunilovskaya, M. and  Koviazina, M.},
      journal={Journal of Linguistics/Jazykovedný casopis},
      year={2017},
      volume={68(3)},
      pages={503--507}
    }
    (CC BY-NC-ND 4.0)

Walking the tightrope between linguistics and language engineering

  • Miloš Jakubíček, Vít Baisa, Jan Bušta, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel

2016

  • R. Evans, A. Gelbukh, G. Grefenstette, P. Hanks, M. Jakubíček, McCarthy D., M. Palmer, T. Pedersen, M. Rundell, P. Rychlý, D. Tugwell, S. Sharoff. Adam Kilgarriff’s Legacy to Computational Linguistics and Beyond. International Conference on Intelligent Text Processing and Computational Linguistics (April 2016)[: 3–25], 2016.
    @article{evans2016adam,
      title={Adam Kilgarriff’s Legacy to Computational Linguistics and Beyond},
      author={Evans, R. and Gelbukh, A. and Grefenstette, G. and Hanks, P. and Jakubíček, M. and McCarthy D. and Palmer, M. and Pedersen, T. and Rundell, M. and Rychlý, P. and Tugwell, D. and Sharoff, S.},
      journal={International Conference on Intelligent Text Processing and Computational Linguistics (April 2016)},
      year={2016},
      volume={},
      pages={3–25},
      publisher={Springer}
    }
  • [ERROR] Missing attributes: title, author, journal
  • An Exploratory Analysis of ScienceBlog
    • Caterina Allais
    • In L’Analisi Linguistica e Letteraria, Facoltà di Scienze Linguistiche e Letterature straniere Università Cattolica del Sacro Cuore, Milano, December 2016, pp. 161–170
  • Analyse de trois systèmes de gestion de corpus pour l’enseignement-apprentissage des langues étrangères (Analysis of three corpus management systems in French)
    • Eva Schaeffer-Lacroix
    • In Alsic [En ligne], Vol. 18, n° 1 | 2015, mis en ligne le 20 décembre 2015, Consulté le 18 janvier 2016.
  • Annotated Amharic Corpora
    • Pavel Rychlý, Vít Suchomel
    • In Petr Sojka, Aleš Horák, Ivan Kopeček, Karel Pala. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings, pp. 295-302, DOI 10.1007/978-3-319-45510-5_34
  • European Union Language Resources in Sketch Engine
    • BAISA, Vít, Jan MICHELFEIT, Marek MEDVEĎ and Miloš JAKUBÍČEK
    • In the Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2799–2803, Slovenia, May 2016.
  • Finding Definitions in Large Corpora with Sketch Engine
    • Vojtěch Kovář, Monika Močiariková, Pavel Rychlý
    • In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. pp. 391–394
  • Tanja Samardzic, Elvira Glaser. Archimob-a corpus of spoken Swiss German. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 10: 4061-4066, 2016.
    @article{samardzic2016archimob,
      title={Archimob-a corpus of spoken Swiss German},
      author={Samardzic, Tanja, Yves Scherrer, and Elvira Glaser},
      journal={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
      year={2016},
      volume={10},
      pages={4061--4066},
      publisher={European Language Resources Association (ELRA)}
    }
  • RuSkELL: Online Language Learning Tool for Russian Language
    • Valentina Apresjan, Vít Baisa, Olga Buivulova, Olga Kultepina
    • In Tinatin Margalitadze, George Meladze. Proceedings of the XVII EURALEX International congress. Tbilisi: Ivane Javakhishvili Tbilisi State University, 2016. pp. 292–299

2015

  • Turkic Language Support in Sketch Engine
    • Vít Baisa and Vít Suchomel
    • In Proceedings of the international conference “Turkic Languages processing: TurkLang 2015”, Russia, September 2015, pp. 214–223
  • Automatic generation of the Estonian Collocations Dictionary database (presentation)
    • Jelena Kallas, Adam Kilgarriff, Kristina Koppel, Elgar Kudritski, Margit Langemets, Jan Michelfeit, Maria Tuulik, Ülle Viks
    • In Kosem, I., Jakubíček, M., Kallas, J., Krek, S. (eds.) Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, August 2015, Herstmonceux Castle, UK., pp. 1–20.
  • Interactive visualization methods for Sketch Engine
    • Lucia Kocincová, Miloš Jakubíček, Vojtěch Kovář and Vít Baisa
    • In Gintaré Grigonyté, Simon Clematide, Andrius Utka, Martin Volk (eds.). Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. Vilnius, Lithuania: Linköping University Electronic Press, Linköpings universitet, 2015, pp. 17–22
  • Learning Chinese with the Sketch Engine
    • Adam Kilgarriff, Nicole Keng, Simon Smith and Wei Bo
    • In Zou, B., Hoey, M. & Smith, S. (eds.). Corpus Linguistics  in Chinese Contexts. Basingstoke: Palgrave, 2015

2014

  • Effective Corpus Virtualization
    • Miloš Jakubíček, Adam Kilgarriff and Pavel Rychlý (2014)
    • In Challenges in the Management of Large Corpora (CMLC-2), May 2014
  • The Sketch Engine: ten years on
    • Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel (2014)
    • In Lexicography: Journal of ASIALEX, volume 1, issue 1, pp. 7–36
  • arTenTen: Arabic Corpus and Word Sketches
    • Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgarriff and Vít Suchomel (2014)
    • In Journal of King Saud University – Computer and Information Sciences, volume 26, issue 4, December 2014, pp. 381–395
  • Hindi Word Sketches
    • Anil Krishna Eragani, Varun Kuchibhotla, Dipti Sharma, Siva Reddy and Adam Kilgarriff (2014)
    • In Proceedings of the Conference on Natural Language Processing (ICON-11), Goa, India, December 2014, pp. 11818–125
  • Web As Corpus: Theory and Practice
    • Maristella Gatto
    • A&C Black, October 2014
  • Text Tokenisation Using unitok
    • Vít Suchomel, Jan Michelfeit and Jan Pomikálek (2014)
    • In Proceedings of the Eighth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2014, Czech Republic, December 2014, pp. 71–75
  • Bilingual Word Sketches: the translate Button
    • Vít Baisa, Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář and Pavel Rychlý
    • In Proceedings of the 16th EURALEX International Congress. 15–19 July 2014, Bolzano, Italy, pp. 505–513

2013

  • The TenTen Corpus Family
    • Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2013)
    • In Proceedings of the 7th International Corpus Linguistics Conference CL 2013, the United Kingdom, July 2013, pp. 125–127
  • Web Spam
    • Adam Kilgarriff and Vít Suchomel (2013)
    • In Proceedings of the 8th Web as Corpus Workshop (WAC-8), the United Kingdom, July 2013, pp. 46–52
  • arTenTen: a new, vast corpus for Arabic
    • Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, Noam Ordan, Ryan Roth and Vít Suchomel (2013)
    • In Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics, the United Kingdom, July 2013, pp. 20
  • Intrinsic Methods for Comparison of Corpora
    • Vít Baisa and Vít Suchomel (2013)
    • In Proceedings of the Seventh Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2013, Czech Republic, December 2013, pp. 51–58
  • 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング
    • (Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen)
    • Irena Srdanović, Vít Suchomel, Toshinobu Ogiso and Adam Kilgarriff (2013)
    • 『「第3回コーパス日本語学ワークショップ」予稿集』国立国語研究所 言語資源研究系・コーパス開発センター (In Proceeding of the 3rd Japanese corpus linguistics workshop, Department of Corpus Studies, Center for Corpus Development, NINJAL), pp. 229–238

2012

  • Word Sense Induction for Novel Sense Detection
    • Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin (2012)
    • In 13th Conference of the European Chapter of the Association for computational Linguistics (EACL 2012), France, April 2012, pp. 591–601
  • Getting to know your corpus
    • Adam Kilgarriff (2012)
    • In Proceedings of The 15th International Conference on Text, Speech and Dialogue (TSD), Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (eds.), Czech Republic, September 2012, pp. 3–15
  • Detecting Spam in Web Corpora
    • Vít Baisa and Vít Suchomel (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 69–76
  • Recent Czech Web Corpora
    • Vít Suchomel (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 77–83
  • Finding Multiwords of More Than Two Words
    • Adam Kilgarriff, Pavel Rychlý, Vojtěch Kovář and Vít Baisa (2012)
    • In Proceedings of the 15th EURALEX International Congress, Norway, August 2012, pp. 693–700
  • Word Sketches for Turkish
    • Bharat Ram Ambati, Siva Reddy and Adam Kilgarriff (2012)
    • In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 2945–2950
  • Learner corpora and second language acquisition
    • Meng Huat Chau (2012)
    • In Corpus Applications in Applied Linguistics, K. Hyland, M. H. Chau & M. Handford (eds.), London: Continuum, 2012, pp. 191–207
  • Setting up for corpus lexicography
    • Adam Kilgarriff, Jan Pomikálek, Miloš Jakubíček and Pete Whitelock (2012)
    • In Proceedings of the 15th EURALEX International Congress, Norway, August 2012, pp. 31–55
  • Corpus Tools for Lexicographers
    • Adam Kilgarriff and Iztok Kosem (2012)
    • In Electronic Lexicography, Sylviane Granger and Magali Paquot (eds.), Oxford University Press, October 2012, pp. 31–55
  • Vietnamese Word Sketches
    • Adam Kilgarriff and Phuong Le-Hong (2012)
    • In Workshop on Vietnamese Language and Speech Processing (IEEE-RIVF 9), Vietnam, February 2012, pp. 1–4
  • Building A Thesaurus Using LDA-Frames
    • Jiří Materna (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 97–103

2011

  • Comparable Corpora BootCaT
    • Adam Kilgarriff, Avinesh PVS and Jan Pomikálek (2011)
    • In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 122–128
  • GDEX for Slovene
    • Iztok Kosem, Miloš Husák and Diana McCarthy (2011)
    • In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 151–159
  • Large Web Corpora for Indian Languages
    • Adam Kilgarriff and Girish Duvuru (2011)
    • In Proceedings of International Conference on Information Systems for Indian Languages (ICISIL), India, 2011 pp. 312–313
  • Polish Word Sketches
    • Adam Radziszewski, Adam Kilgarriff and Robert Lew (2011)
    • In Proceedings of the 5th Language & Technology Conference (LTC), Poland, November 2012, pp. 237–242

2010

  • Helping Our Own
    • Robert Dale and Adam Kilgarriff (2010)
    • In International Natural Language Generation Conference, Dublin, Ireland
  • Studying Word Sketches for Russian
    • Maria Khokhlova and Victor Zakharov (2010)
    • In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12) Malta, May 2010, pp. 3491–3494
  • A Case Study in Word Sketches – Czech Verb vidět
    • Karel Pala and Pavel Rychlý (2010)
    • In A Way with Words: Recent Advances in Lexical Theory and Analysis. A Festschrift for Patrick Hanks. Ed. by Gilles-Maurice de Schryver, Menha Publishers, 2010, – “see”, pp. 187–198
  • Google The Verb
    • Adam Kilgarriff (2010)
    • In Language Resources and Evaluation Journal, 44 (3), pp. 281–290
  • Tickbox Lexicography
    • Adam Kilgarriff and Vojtěch Kovář and Pavel Rychlý (2010)
    • In eLexicography in the 21st century: New challenges, new applications, Presses universitaires de Louvain, Brussels, 2010, pp. 411–418
  • Semi-automatic_dictionary_2010
    • Adam Kilgarriff and Pavel Rychlý (2010)
    • In A Way with Words: Recent Advances in Lexical Theory and Analysis, Uganda: Menha Publishers Ltd., 2010, 299–312
  • Corpora by Web Services
    • Adam Kilgarriff (2010)
    • In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12)Malta, May 2010
  • A corpus factory for many languages
    • Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010)
    • LREC workshop on Web Services and Processing Pipelines, Malta, May 2010
  • The RoWaC Corpus and Romanian Word Sketches
    • Monica Macoveiciuc and Adam Kilgarriff (2010)
    • In Multilinguality and Interoperability in Language Processing with Emphasis on Romanian Edited by Dan Tufis and Corina Forascu. Romanian Academy, pp. 151–168.

2009

  • Scaling to Billion-plus Word Corpora
    • Jan Pomikálek, Pavel Rychlý and Adam Kilgarriff
    • In Advances in Computational Linguistics, Instituto Politécnico Nacional, volume 41, Mexico, 2009, pp. 3–13
  • Simple maths for keywords
    • Adam Kilgarriff
    • In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.
  • Putting the corpus into the dictionary (firstly in 2005 as Linking Dictionary and Corpus)
    • Adam Kilgarriff (2009)
    • In V.B.Y. Ooi, A. Pakir, I.S. Talib and P.K. Tan (eds.). Perspectives in Lexicography: Asia and Beyond, IsraelK Dictionaries 2009, pp. 239–247
  • Extracting distant collocations of adverbs and modality forms using web corpus and query system
    • Irena Srdanovic, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
    • 「ウェブコーパスと検索システムを利用した推量副詞とモダリティ形式の遠隔共起抽出と日本語教育への応用」『自然言語処理』(Extracting distant collocations of adverbs and modality forms using web corpus and query system, Journal of Natural Language Processing), 16/4, pp. 29–46
  • Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations
    • Irena Srdanovic, Andrej Bekeš and Kikuko Nishina (2009)
    • 「コーパスに基づいた語彙シラバス作成に向けて―推量的副詞と文末モダリティの共起を中心にして―」『日本語教育』142号 (Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations, Journal of Japanese Language Education, 142, pp. 69–79)
  • Czech Word Sketch Relations with Full Syntax Parser
    • Aleš Horák, Pavel Rychlý and Adam Kilgarriff
    • In After Half a Century of Slavonic Natural Language Processing. Czech Republic, Brno: Masaryk University, 2009, pp. 101–112. ISBN 978-80-7399-815-8.
  • Classifying corpora based on adverbs distribution
    • Irena Srdanović, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
    • In International Quantitative Linguistics Conference (Qualico), Austria, September 2009

2008

  • A Lexicographer-Friendly Association Score
    • Pavel Rychlý (2008)
    • In Second Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2008. Brno, Masaryk University, 2008, pp. 6–9. ISBN 978-80-210-4741-9.
  • Cleaneval: a Competition for Cleaning Web Pages
    • Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff
    • In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco, May 2008, pp. 638–643
  • A web corpus and word sketches for Japanese
    • Irena Srdanović, Tomaž Erjavec and Adam Kilgarriff (2008)
    • A web corpus and word-sketches for Japanese『自然言語処理』(Journal of Natural Language Processing) 15/2, 137–159. (reprinted in Information and Media Technologies 3/3, 2008, pp. 529–551)

2007

  • Manatee/bonito – a modular corpus manager
    • Pavel Rychlý (2007)
    • In First Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2007. Brno: Masaryk University, 2007, pp. 65–70. ISBN 978-80-210-4471-5

2006

  • Slovene Word Sketches
    • Simon Krek and Adam Kilgarriff (2006)
    • In Proceedings 5th Slovenian/First International Languages Technology Conference, Slovenia, October 2006

2005 and earlier

  • Chinese Word Sketches
    • Adam Kilgarriff, Chu-Ren Huang, Pavel Rychlý, Simon Smith and David Tugwell (2005)
    • In Proc. Asialex, Singapore, June 2005
  • The sketch engine
    • Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, and David Tugwell (2004)
    • In Proceedings of the 11th EURALEX International Congress. France, July 2004, pp. 105–116 (reprinted in Lexicology: Critical concepts in Linguistics P. W. Hanks (ed.) Routledge, 2007)
  • Linguistic Search Engine
    • Adam Kilgarriff (2003)
    • In Proceedings of Workshop on Shallow Processing of Large Corpora, SProLaC03, the United Kingdom, pp. 53–58.

If you have any Sketch Engine related paper please do send the details and if possible a link to the document to us (email: support@sketchengine.eu)

Theses related to Sketch Engine

Romani Emma. Building A Multilingual Outlier Detection Dataset For The Evaluation Of Distributional Thesauri And Word Embeddings. Master's thesis, The University of Pavia, 2022. Supervised by Elisabetta Jezek
@mastersthesis{emma2022building,
  title={Building A Multilingual Outlier Detection Dataset For The Evaluation Of Distributional Thesauri And Word Embeddings},
  author={Emma, Romani},
  school={The University of Pavia},
  year={2022}
}

Suchomel Vít. Better Web Corpora For Corpus Linguistics And NLP. PhD thesis, Masaryk University, Faculty of Informatics, 2021.
@phdthesis{vít2021better,
  title={Better Web Corpora For Corpus Linguistics And NLP},
  author={Vít, Suchomel},
  school={Masaryk University, Faculty of Informatics},
  year={2021}
}

Abstract: The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods. This thesis presents a web crawler designed to obtain texts from the internet allowing to build large text corpora for NLP and linguistic applications. An asynchronous communication design (rather than usual synchronous multi-threaded design) was implemented for the crawler to provide an easy to maintain alternative to other web spider software. Cleaning techniques were devised to transform the messy nature of data coming from the uncontrolled environment of the internet. However, it can be observed that usability of recently built web corpora is hindered by several factors: The results derived from statistical processing of corpus data are significantly affected by the presence of non-text (web spam, computer generated text and machine translation) in text corpora. It is important to study the issue to be able to avoid non-text at all or at least decrease its size in web corpora. Another observed factor is the case of web pages or their parts written in multiple languages. Multilingual pages should be recognised, languages identified and text parts separated to respective monolingual corpora. This thesis proposes additional cleaning stages in the process of building text corpora which help to deal with these issues. Unlike traditional corpora made from printed media in the past decades, sources of web corpora are not categorised and described well, thus making it difficult to control the content of the corpus. Rich annotation of corpus content is dealt with in the last part of the thesis. An inter-annotator agreement driven English genre annotation and two experiments with supervised classification of text types in English and Estonian web corpora are presented.

Kletečka Jiří. Wikipedia Learner's Corpus. Master's thesis, Masaryk University, Faculty of Informatics, 2017. (in Czech)
@mastersthesis{jiří2017wikipedia,
  title={Wikipedia Learner's Corpus},
  author={Jiří, Kletečka},
  school={Masaryk University, Faculty of Informatics},
  year={2017}
}

Abstract: This bachelor’s thesis deals with an automated creation of error-annotated corpus from Wikipedia history of articles. Such corpus contains the newest versions of articles with marked errors obtained from their editing history. For that reason, a new tool was designed and implemented. After implementation, it was used in the process of corpus creation using Czech Wikipedia database dump and this corpus was uploaded to the faculty server for public use through interface of Sketch Engine.

Cukr Michal. Czech corpus of example sentences. Master's thesis, Masaryk University Faculty of Arts, 2017. (in Czech)
@mastersthesis{michal2017czech,
  title={Czech corpus of example sentences},
  author={Michal, Cukr},
  school={Masaryk University Faculty of Arts},
  year={2017}
}

Abstract: The purpose of this work was creating a Czech text corpus of sentence examples for a special language-learning interface SkELL. As source texts, we downloaded websites chosen for selective harvests by Czech Webarchiv and Czech Wikipedia including discussion. The third source is a part of JSI Newsfeed Corpus. Crawled texts were prepared by tools for corpus processing and the final text collection was deduplicated. Afterwards, we performed multiple cleaning. In the thesis, there are some examples from the created corpus. This corpus of Czech sentence examples is placed in the university installation of Sketch Engine (https://ske.fi.muni.cz/). The public access to the corpus is via SkELL interface available at https://skell.sketchengine.eu/#home?lang=cs.

Rábara Radoslav. Parallelization of the corpus manager's time-consuming operations. Master's thesis, Masaryk University, Faculty of Informatics, 2016. (in Czech)
@mastersthesis{radoslav2016parallelization,
  title={Parallelization of the corpus manager's time-consuming operations},
  author={Radoslav, Rábara},
  school={Masaryk University, Faculty of Informatics},
  year={2016}
}

Abstract: The Manatee corpus manager can process large corpora containing billions of words. Some operations with search results from such large corpora can be time-consuming. This thesis provides and describes a system that enables computation of the selected operations in parallel. The system is evaluated on a single computer, and on a cluster of computers. The evaluation contains evaluation of the scalability, and comparions with the Manatee system and a MapReduce system that provides a platform for distributed computing. 

Lucia Kocincová (2015). Interactive visualization methods for Sketch Engine. Master thesis. Masaryk University, Faculty of Informatics.

Abstract: Visualization is undoubtedly one of the most desired methods for displaying data, especially when dealing with so called big data. Visualization can uncover unnoticed and hidden relationships within the data and in addition, it enables the users to understand and interpret the data with less effort. This thesis focuses on interactive visualizations generated from the corpora data. First, it introduces the state-of-the-art tools for corpora visualizations and a corpus management system named Sketch Engine, for which numerous design concepts were created. Then four of them – corpora overview, thesaurus, word sketch and word sketch difference – were implemented as an online application with the main use of the Data-Driven Documents library. Last, these visualizations were evaluated by the user testing which revealed that the implemented concepts were not only graphically very appealing but also helpful. Therefore, the interactive visualizations will be incorporated in the Sketch Engine online interface in the upcoming future.

Matouš Ejem (2015). English learner corpora [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: Learner corpora conjoin second language acquisition research, foreign language teaching and corpus linguistics. In this work I present available English learner corpora.

Lucie Kaplanová (2015). Collection of linguistically motivated examples of CQL [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: This bachelor thesis deals with query language for corpora called CQL (Corpus Query Language). It explains use of individual operators, attributes, and structures that can be used in CQL search. The thesis also includes a set of linguistically oriented CQL queries for Czech and English.

Monika Močiariková (2015). Methods for Automatic Acquisition of Dictionary Definitions [in Slovak]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: The thesis is trying to explain the term definition and why it is difficult to say whether some sentences are definitions or not. It also describes the Sketch Engine system and the CQL language. The practice part is dedicated to design, implementation and evaluation of queries for automatic definition search.

Dominika Talianová (2014). Corpus Data Visualization. Bachelor thesis. Masaryk University, Faculty of Informatics.

Abstract: This thesis focuses on corpus data represented in graphical form. More closely, it consists of a recherché on visualization tools and a website created to hold visualizations based on two features of Sketch Engine, namely Word Sketch and Sketch-diff. These visualizations represent collocations and their salience in connection to different lemmas. The data essential for these visualizations are processed with the use of JavaScript and its D3 library in a JSON format and are provided by Natural Language Processing Centre at Masaryk University, Faculty of Informatics in Brno.

Radoslav Rábara (2014). Concurrent programming in searching text corpora [in Slovak]Bachelor thesis. Masaryk University, Faculty of Informatics.
Abstract: The aim of this thesis is to study approaches used in concurrent processing and to apply them to the evaluation of queries in the system Manatee. Part of the work is not only a detailed evaluation of queries processing speed with various number of cores available during the evaluation, but also a comparision of the length of code between the old and the new implementation.
Ondřej Herman (2013). Automatic methods for detection of word usage in time. Bachelor thesis. Masaryk University, Faculty of Informatics.

Abstract: From a natural language corpus, word usage data over time can be extracted. To detect and quantify change in this data, automatic procedures can be employed. In this work, the theory of ordinary and robust regression methods is discussed and applied to real world data with great success. A Python implementation is included. Smoothing of time series and detection of seasonality is examined, but ultimately this path does not seem to give satisfactory results for the data explored.

Miloš Husák (2008). Automatic Retrieval of Good Dictionary Examples. Bachelor thesis. Masaryk University, Faculty of Informatics.
Abstract: This thesis proposes and implements an algorithm for evaluation of sentences with respect to their understandability and informativeness. It can be embedded into a variety of applications, such as corpus querying tools or automated dictionaries. The proposed algorithm is highly customizable, since it employs a variety of criteria approximating the similarity of sentences to good dictionary examples. It was optimized using machine learning algorithms according to a set of manually labelled concordances. The algorithm is usable in practical applications, however it is still being developed.