Toxicity Corpus

The Toxicity Corpus was created using data from the Civil Comments platform, which closed in 2017 and released approximately 2 million public comments as an open archive. Sponsored by Jigsaw (a research group within Google), the data was annotated by human raters to identify degree of toxicity.

The dataset includes comments, each accompanied by a toxicity label (target) representing the fraction of raters who considered the comment to be toxic.

The corpus includes several attributes that you can use in your search queries. For instance, one can use the attribute Toxicity (in %) to find neutral or very toxic comments. Simply, the higher the percentage, the more toxic the comment is. Other interesting attributes might be Sexual explicit (in %), Threat (in %), Sad – the amount of reactions given by users etc. More attributes can be found in Corpus info and Text Type Analysis.

An example CQL query to find toxic comments with percentage more than 90% would be:

<s/> within <doc toxicity>="90"/>

The Toxicity Corpus serves as a valuable resource for studying online conversation civility and toxicity, supporting linguistic analysis and model development. Researchers can access this corpus within Sketch Engine for in-depth analysis and research purposes.

Original data

The original dataset can be downloaded from Kaggle, a data science competition platform.

Search the Toxicity corpus

Sketch Engine offers a range of tools to work with this Toxicity corpus.

Corpus sizes

Tokens 118,695,956
Words 102,132,547
Sentences 6,787,252
Documents 1,999,515

Tools to work with the Toxicity corpus from the web

A complete set of Sketch Engine tools is available to work with this Toxicity corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multiword units
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Toxicity Corpus (toxicity_en)

version toxicity_en (September 2023)

  • 102 million words

cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, nithum. (2019). Jigsaw Unintended Bias in Toxicity Classification. Kaggle.

Other text corpora

Sketch Engine offers 800+ language corpora.