COVID-19 corpus from Open Research Dataset (CORD-19)

What is the COVID-19 corpus?

The Covid-19 corpus is an English corpus that consists of articles that were released as part of the COVID-19 Open Research Dataset (CORD-19). The final version of the CORD-19 dataset was released on June 2, 2022. The COVID-19 corpus consists of more than 370,000 papers about coronavirus and related topics.

The data were taken from the Github repository of CORD-19 https://github.com/allenai/cord19/blob/master/README.md

The corpus has rich metadata containing information about articles such as author, DOI, journal, publish time, title, etc. A full list of available metadata can be found in the Github repository. Since the original data contain many duplications, the corpus was deduplicated on the doc.doi attribute.

How to access the corpus?

The corpus is in the “open” category: no account is required to get access, just visit

http://ske.li/covid_19

Please note: some functionalities (e.g. building user subcorpora or extracting terms and keywords against a large reference corpus) require having an account. Please create a trial account and email your username to inquiries@sketchengine.eu with the subject “Covid-19 corpus” and we will give you a free 1-year account to access this corpus.

Tools to work with COVID-19 corpus

A complete set of tools is available for working with the COVID-19 to generating:

word sketch – collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency lists of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Downloads

If you are interested in the full corpus data, please contact us at support@sketchengine.eu. The Covid-19 corpus is tokenized, part-of-speech tagged and lemmatized.

Please do note that the PoS tagging and lemmatization have been done using TreeTagger and thus the annotation is available only for non-commercial purposes. The original license applies to source texts only.

Changelog

See original data changelog. Information below relates to the Sketch Engine processing.

version 2022-07-23 (published August, 2022 in Sketch Engine)

data update

version 2020-03-27 (published March 30th, 2020 in Sketch Engine)

data update

version 2020-03-27 (published March 30th, 2020 in Sketch Engine)

data update + all metadata now preserved for articles

version 2020-03-20 (published March 26th, 2020 in Sketch Engine)

initial version, mark-up of abstracts, documents, back matter and citations

Bibliographic references

How to reference Sketch Engine

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Michael Kinney, Yunyao Li, Ziyang Liu, William Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, et al.. 2020. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Linguistics.

COVID-19 Open Research Dataset (CORD-19). 2020. Version 2022-06-02. Retrieved from https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html. Accessed 2022-06-02.

Search the COVID-19 corpus from Open Research Dataset

Sketch Engine offers a range of tools to work with the COVID-19 corpus.

open corpus

Covid-19 corpus - concordance

Other English corpora

Explore our largest Timestamped English corpus with 80+ billion words.

available English corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

What is the COVID-19 corpus?

How to access the corpus?

Tools to work with COVID-19 corpus

version 2022-07-23 (published August, 2022 in Sketch Engine)

version 2020-03-27 (published March 30th, 2020 in Sketch Engine)

version 2020-03-27 (published March 30th, 2020 in Sketch Engine)

version 2020-03-20 (published March 26th, 2020 in Sketch Engine)

Search the COVID-19 corpus from Open Research Dataset

Other English corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine