What is Covid-19 corpus?
The Covid-19 corpus is a corpus that consists of texts that were released as part of the COVID-19 Open Research Dataset (CORD-19). Reference: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-05-02. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2020-05-02. doi:10.5281/zenodo.3715505
How to access the corpus?
Please note: some functionalities (e.g. building user subcorpora or extracting terms and keywords against a large reference corpus) require having an account. Please create a trial account and email your username to email@example.com with the subject “Covid-19 corpus” and we will give you a free account to access this corpus.
Tools to work with Covid-19 corpus
A complete set of tools is available to work with the Covid-19 to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
If you are interested in the full corpus data, please contact us at firstname.lastname@example.org. The Covid-19 corpus is tokenized, part-of-speech tagged and lemmatized.
Please do note that the PoS tagging and lemmatization have been done using TreeTagger and thus the annotation is available only for non-commercial purposes. The original license applies to source texts only.
See original data changelog. Information below relates to the Sketch Engine processing.
version 2020-03-27 (published March 30th, 2020 in Sketch Engine)
- data update + all metadata now preserved for articles
version 2020-03-20 (published March 26th, 2020 in Sketch Engine)
- initial version, mark-up of abstracts, documents, back matter and citations
This corpus consists of texts that were released as part of the COVID-19 Open Research Dataset (CORD-19). Reference: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-13. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2020-03-22. doi:10.5281/zenodo.3715506
Search the Covid-19 corpus
Sketch Engine offers a range of tools to work with the Covid-19 corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract. Use our Quick Start Guide to learn it in minutes.