Toxicity Corpus

The Toxicity Corpus was created using data from the Civil Comments platform, which closed in 2017 and released approximately 2 million public comments as an open archive. Sponsored by Jigsaw (a research group within Google), the data was annotated by human raters to identify degree of toxicity.

The dataset includes comments, each accompanied by a toxicity label (target) representing the fraction of raters who considered the comment to be toxic.

The corpus includes several attributes that you can use in your search queries. For instance, one can use the attribute Toxicity (in %) to find neutral or very toxic comments. Simply, the higher the percentage, the more toxic the comment is. Other interesting attributes might be Sexual explicit (in %), Threat (in %), Sad – the amount of reactions given by users etc. More attributes can be found in Corpus info and Text Type Analysis.

An example CQL query to find toxic comments with percentage more than 90% would be:

<s/> within <doc toxicity>="90"/>

The Toxicity Corpus serves as a valuable resource for studying online conversation civility and toxicity, supporting linguistic analysis and model development. Researchers can access this corpus within Sketch Engine for in-depth analysis and research purposes.

Original data

The original dataset can be downloaded from Kaggle, a data science competition platform.

Search the Toxicity corpus

Sketch Engine offers a range of tools to work with this Toxicity corpus.

open in Sketch Engine

about Sketch Engine

Corpus sizes


Tokens	118,695,956
Words	102,132,547
Sentences	6,787,252
Documents	1,999,515

Tools to work with the Toxicity corpus from the web

A complete set of Sketch Engine tools is available to work with this Toxicity corpus to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multiword units
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

Toxicity Corpus (toxicity_en)

version toxicity_en (September 2023)

102 million words

Bibliography

cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, nithum. (2019). Jigsaw Unintended Bias in Toxicity Classification. Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Toxicity Corpus

Original data

Search the Toxicity corpus

Corpus sizes

Tools to work with the Toxicity corpus from the web

Toxicity Corpus (toxicity_en)

Other text corpora

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine