DOAJ corpora – Open Access Journals corpora

The Open Access Journals (OAJ) corpora are text corpora comprised of journals covering all areas of science, technology, medicine, social science, and humanities in dozens of languages.

The OAJ corpora contain rich metadata about journals, such as title, country, year of publication, etc. It is also possible to search by the keywords of articles.

Detailed information about Open Access Journals can be found on the original website Directory Open Acess Journals.

A list of OAJ corpora in Sketch Engine

  • Open Access Journals (English) – 2.6 billion words

More languages will be available soon.

Part-of-speech tagset

OAJ corpora are POS tagged depending on language specifications.

Tools to work with the Open Access Journals corpus

A complete set of tools is available to work with this OAJ corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

DOAJ English corpus in detail

The chart shows the distribution of the parts of speech in the DOAJ English corpus.

Basic information

Frequency*
Tokens 3 350
Words 2 663
Sentences 123
Documents 0.66

* the figures above are rounded to million

Metadata (Structures and attributes)

Metadata Description Example
Authors author‘s name Wei Wang
Country of journal country of issue US,
Document id document identification 9999884fafc844958864f26e06a22373
Identifier print ISSN and electronic ISSN pissn:2078-0958;eissn:2078-0966
Journal languages language of the journal EN, English
Journal number number of the journal 1
Journal publisher publisher of the journal Copernicus Publications
Journal title title of the journal Mathematical Problems in Engineering
Journal volume volume of the journal 7
Keywords Keywords of the journal climate change
Last updated Last modification 2016-09-30T18:33:16Z
Month month of publication 12
Subjects Subjects of the document Health Sciences
Time stamp Type of document 2004-05-31T00:00:00Z
Title Name of the article Sovereignty in Conflict
Url web address http://www.ijpsonline.com/article.asp?issn=0250-474X
Wordcount Number of words in the document 1081
Year of publication year of publication 2014

Texts in DOAJ are published under Creative Commons (CC) license.

More information about the licensing can be found at https://doaj.org/publishers#licensing

Search the Open Access Journals corpus

Sketch Engine offers a range of tools to work with this English corpus.

or

Other English corpora

Explore our largest Timestamped English corpus with 35+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.