DOAJ corpora – Directory of Open Access Journals

The Directory of Open Access Journals (DOAJ) corpora are text corpora comprised of journals covering all areas of science, technology, medicine, social science, and humanities in dozens of languages.

The DOAJ corpora contain rich metadata about journals, such as title, country, year of publication, etc. It is also possible to search by the keywords of articles.

Detailed information about Open Access Journals can be found on the original website Directory Open Acess Journals.

A list of DOAJ corpora in Sketch Engine

  • Directory of Open Access Journals ((DOAJ) – English – 2.6 billion words

Part-of-speech tagset

DOAJ corpora are POS tagged depending on language specifications.

Tools to work with the Open Access Journals corpus

A complete set of tools is available to work with this OAJ corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords– terminology extraction of one-word
  • text type analysis – statistics of metadata in the corpus

DOAJ English corpus in detail

The chart shows the distribution of the parts of speech in the DOAJ English corpus.

Basic information

Tokens 3 350
Words 2 663
Sentences 123
Documents 0.66

* the figures above are rounded to million

Metadata (Structures and attributes)

Metadata Description Example
Authors author‘s name Wei Wang
Country of journal country of issue US,
Document id document identification 9999884fafc844958864f26e06a22373
Identifier print ISSN and electronic ISSN pissn:2078-0958;eissn:2078-0966
Journal languages language of the journal EN, English
Journal number number of the journal 1
Journal publisher publisher of the journal Copernicus Publications
Journal title title of the journal Mathematical Problems in Engineering
Journal volume volume of the journal 7
Keywords Keywords of the journal climate change
Last updated Last modification 2016-09-30T18:33:16Z
Month month of publication 12
Subjects Subjects of the document Health Sciences
Time stamp Type of document 2004-05-31T00:00:00Z
Title Name of the article Sovereignty in Conflict
Url web address
Year of publication year of publication 2014

Texts in DOAJ are published under Creative Commons (CC) license.

More information about the licensing can be found at

Search the corpora of the Directory of Open Access Journals

Sketch Engine offers a range of tools to work with these DOAJ corpora.

English Trends corpus

Explore our largest English corpus, which totals over 80 billion words and grows automatically every week.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.