DOAJ corpora – Open Access Journals corpora

The Open Access Journals (OAJ) corpora are text corpora comprised of journals covering all areas of science, technology, medicine, social science, and humanities in dozens of languages.

The OAJ corpora contain rich metadata about journals, such as title, country, year of publication, etc. It is also possible to search by the keywords of articles.

Detailed information about Open Access Journals can be found on the original website Directory Open Acess Journals.

A list of OAJ corpora in Sketch Engine

  • Open Access Journals (English) – 2.6 billion words

More languages will be available soon.

Part-of-speech tagset

OAJ corpora are POS tagged depending on language specifications.

Tools to work with the Open Access Journals corpus

A complete set of tools is available to work with this OAJ corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

DOAJ English corpus in detail

The chart shows the distribution of the parts of speech in the DOAJ English corpus.

Further information about texts in the corpus

Basic information

Tokens 3 350
Words 2 663
Sentences 123
Documents 0.66

* the figures above are rounded to million


Texts in DOAJ are published under Creative Commons (CC) license.

More information about the licensing can be found at

Search the Open Access Journals corpus

Sketch Engine offers a range of tools to work with this English corpus.


Other English corpora

Explore our largest Timestamped English corpus with 30+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.