Mueller Report corpus | Sketch Engine

Mueller report as English corpus

The Mueller report corpus is an English corpus made up of the entire Mueller Report. The corpus was prepared thanks to the work of the team led by Damir Cavar. (NLP Lab at Indiana University). They converted the original texts from PDF to the raw Unicode encoded text, mark footnotes and removed words or characters that were surrounded by square brackets (e.g. [T]he…).

The Mueller report, officially titled Report On The Investigation Into Russian Interference In The 2016 Presidential Election, is the official report documenting the findings and conclusions of former Special Counsel Robert Mueller’s investigation into Russian efforts to interfere in the 2016 United States presidential election, allegations of conspiracy or coordination between Donald Trump’s presidential campaign and Russia. (Wikipedia)

The original data of the Mueller report corpus is available in this GitHub repository.

Part-of-speech tagset

This English corpus was tagged by TreeTagger using Penn Treebank tagset with Sketch Engine modifications.

Sketch Engine modifications to the original Mueller report corpus

part-of-speech tagging
the tag that marks the beginning of each footnote (# means a numerical order of the footnote) is processed as the standard token, e.g. FN237

Tools to work with the Mueller report corpus

A complete set of tools is available to work with this Mueller report corpus to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives, etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

References & citation

Corpus text – entire Mueller report

The United States Department of Justice. Report on the Investigation into Russian Interference in the 2016 Presidential Election. Retrieved from the Department of Justice website: https://www.justice.gov/archives/sco/file/1373816/download

Corpus annotation (structures, conversion from PDF to raw text)

Damir Cavar <dcavar@me.com>, Semiring Inc. The NLP-Lab (Damir Cavar’s Research Lab at Indiana University in Bloomington) https://github.com/SemiringInc/Mueller-Report-Corpus