Corpus of Elsevier Open Access Journals

The Elsevier OA CC-BY Corpus is an English corpus consisting of 40,000 scientific research papers which are a representative sample from across scientific disciplines. The Elsevier corpus is comprised of open access articles with the CC-BY 4.0 (Creative Commons) license available in Elsevier journals of a Dutch publishing company specializing in scientific, technical, and medical content. These articles were published between 2014 and 2020.

The original data of the Elsevier OA CC-BY corpus have been prepared by Daniel Kershaw and Rob Koeling. More information about the corpus can be found in the Digital Commons (Elsevier) deposit.

Part-of-speech tagset

The Elsevier Open Access Journals corpus is part-of-speech tagged by the TreeTagger part-of-speech tagset.

Basic information

Tokens 43,125,207,462
Words 36,561,273,153
Sentences 2,008,143,278
Web pages 78,373,887

Elsevier OA CC-BY Corpus – year distribution

The English corpus of Elsevier Open Access Journals contains 40,000 scientific articles from 2014 to 2020.

Hover over the chart to display a number of tokens of the particular topic.

Search the Elsevier OA CC-BY Corpus

Sketch Engine offers a range of tools to work with this English corpus of Elsevier Journals.

Tools to work with the Elsevier OA CC-BY Corpus

A complete set of Sketch Engine tools is available to work with this English corpus of scientific papers to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Kershaw, Daniel; Koeling, Rob (2020), “Elsevier OA CC-BY Corpus”, Mendeley Data, V1, doi: 10.17632/zm33cdndxs.1

Other English corpora

Explore our largest Timestamped English corpus with 70+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.