The Toxicity Corpus was created using data from the Civil Comments platform, which closed in 2017 and released approximately 2 million public comments as an open archive. Sponsored by Jigsaw, the data was annotated by human raters to identify toxic attributes.
The dataset includes comments, each accompanied by a toxicity label (target) representing the fraction of raters who considered the comment to be toxic.
The corpus includes several attributes that you can use in your search queries. For instance, one can use the attribute Toxicity (in %) to find neutral or very toxic comments. Simply, the higher the percentage, the more toxic the comment is. Other interesting attributes might be Sexual explicit (in %), Threat (in %), Sad – the amount of reactions given by users etc. More attributes can be found in Corpus info and Text Type Analysis.
An example CQL query to find toxic comments with percentage more than 90% would be:
<s/> within <doc toxicity>="90"/>
The Toxicity Corpus serves as a valuable resource for studying online conversation civility and toxicity, supporting linguistic analysis and model development. Researchers can access this corpus within Sketch Engine for in-depth analysis and research purposes.
More information about the dataset can be found here: https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data
Search the Toxicity corpus
Sketch Engine offers a range of tools to work with this Toxicity corpus.
Further information about texts in the Toxicity corpus
Tools to work with the Toxicity corpus from the web
A complete set of Sketch Engine tools is available to work with this Toxicity corpus to generate:
- word sketch – English collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multiword units
- word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Toxicity Corpus (toxicity_en)
version toxicity_en (September 2023)
- 102 million words
cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, nithum. (2019). Jigsaw Unintended Bias in Toxicity Classification. Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.