Mueller report as English corpus

The Mueller report corpus is an English corpus made up of the entire Mueller Report. The corpus was prepared thanks to the work of the team led by Damir Cavar. (NLP Lab at Indiana University). They converted the original texts from PDF to the raw Unicode encoded text, mark footnotes and removed words or characters that were surrounded by square brackets (e.g. [T]he…).

The Mueller report, officially titled Report On The Investigation Into Russian Interference In The 2016 Presidential Election, is the official report documenting the findings and conclusions of former Special Counsel Robert Mueller’s investigation into Russian efforts to interfere in the 2016 United States presidential election, allegations of conspiracy or coordination between Donald Trump’s presidential campaign and Russia. (Wikipedia)

The original data of the Mueller report corpus is available in this GitHub repository.

Part-of-speech tagset

This English corpus was tagged by TreeTagger using Penn Treebank tagset with Sketch Engine modifications.

Sketch Engine modifications to the original Mueller report corpus

  • part-of-speech tagging
  • the tag that marks the beginning of each footnote (# means a numerical order of the footnote) is processed as the standard token, e.g. FN237

Tools to work with the Mueller report corpus

A complete set of tools is available to work with this Mueller report corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of English nouns, verbs, adjectives, etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Corpus text – entire Mueller report

The United States Department of Justice. Report on the Investigation into Russian Interference in the 2016 Presidential Election. Retrieved from the Department of Justice website: https://www.justice.gov/archives/sco/file/1373816/download

Corpus annotation (structures, conversion from PDF to raw text)

Damir Cavar <dcavar@me.com>, Semiring Inc. The NLP-Lab (Damir Cavar’s Research Lab at Indiana University in Bloomington) https://github.com/SemiringInc/Mueller-Report-Corpus

Search the Mueller report corpus

Sketch Engine offers a range of tools to work with this corpus of Mueller report.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.