What is Covid-19 corpus?

This corpus consists of texts that were released as part of the COVID-19 Open Research Dataset (CORD-19). Reference: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-27. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2020-03-28. doi:10.5281/zenodo.3715506

How to access the corpus?

The corpus is in the “open” category: no account is required to get access, just visit

http://ske.li/covid_19

Please note: some functionalities (e.g. building user subcorpora or extracting terms and keywords against a large reference corpus) require having an account. Please create a trial account and email your username to inquiries@sketchengine.eu with the subject “Covid 19 corpus” and we will give you a free account to access this corpus. EU researchers will typically have free access through the ELEXIS infrastructure.

Tools to work with Covid-19 corpus

A complete set of tools is available to work with the Covid-19 to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

You can download the source vertical texts that is tokenised, part-of-speech tagged and lemmatised.

Please do note that the PoS tagging and lemmatisation has been done using TreeTagger and thus the annotation is available only for non-commercial purposes. Original licence applies to source texts only.

See original data changelog. Information below relates to the Sketch Engine processing.

version 2020-03-27 (published March 30th, 2020 in Sketch Engine)

  • data update + all metadata now preserved for articles

version 2020-03-20 (published March 26th, 2020 in Sketch Engine)

  • initial version, mark-up of abstracts, documents, back matter and citations

How to reference Sketch Engine

This corpus consists of texts that were released as part of the COVID-19 Open Research Dataset (CORD-19). Reference: COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-13. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2020-03-22. doi:10.5281/zenodo.3715506

Search the Covid-19 corpus

Sketch Engine offers a range of tools to work with the Covid-19 corpus.

Other English corpora

Explore our largest Timestamped English corpus with 46+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract. Use our Quick Start Guide to learn it in minutes.