SiBol corpus of English broadsheets

SiBol: Corpus of English broadsheet newspapers 1993–2021

The English language newspapers corpus (SiBol) is an English corpus made up of articles collected from various English language newspapers of the years 1993–2021. The SiBol corpus contains around 850 million words in 2 million articles from 18 newspapers. The initial version of the corpus, containing UK broadsheets, was created in 2011 and was extended in 2017 and 2021 to include newspapers from other countries including India, USA, Hong Kong, Nigeria and the Arab world, as well as UK tabloids. The corpus search can be restricted by a specific year, newspaper, author or date.

Part-of-speech tagset

The SiBol corpus was annotated by the TreeTagger tool using the Penn Treebank tagset with Sketch Engine modifications.

Authors of the SiBol corpus

about SiBol project

The SiBol corpus was compiled by a small team of linguistics researchers at the Universities of Siena and Bologna.

Content

See graphs describing the distribution of corpus texts according to years and newspaper titles.

Articles by year of publication

Hover over the chart to display the number of tokens of the particular year.

The years 2011, 2012 and 2014 have only 1919 articles in total which is less than 0.1 % of the whole corpus.

Articles by newspaper title

Newspaper	Number of articles
The Times	447,524
The Daily Telegraph	362,139
The Guardian	314,544
Times of India	255,642
Daily Mirror	131,674
The Sun	118,242
New York Times	94,000
Sunday Times	84,872
Daily Mail	62,131
The Express	34,491
The Sunday Telegraph	25,340
Metro	23,970
This Day Lagos	22,172
South China Morning Post	19,484
Gulf News	17,787
The Evening Standard	13,977
Washington Times	10,654
Daily News Egypt	8,486

Tools to work with the SiBol corpus

A complete set of Sketch Engine tools is available to work with this English corpus of broadsheet newspapers to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives, etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
trends – diachronic analysis automatically identifies neologisms and changes in use
text type analysis – statistics of metadata in the corpus

Changelog

version 2021 (November 2022)

data added – 0.5 million articles (200 million words) from 2021
processed by recent English TreeTagger pipeline version 3.1
new term grammar version 3.1

version 2.1 (10 July 2017)

data added – 768,687 articles from 13 newspapers, including 9 new newspapers.
9 new newspapers include: Daily Mirror, Daily Mail, The New York Times, Washington Post, This Day Lagos, Times of India, Gulf News, Daily News Egypt and South China Morning Post.
corpus updated using new English processing pipeline. The format of the corpus is now compatible with current user corpora.

version 1.1 (1 Dec 2011)

recompiled, installed at the production server

version 1.1 (9 Nov 2011)

changed deduplication settings to “-n 7 -m” – 385 million tokens in 787,000 newspaper articles
set name to “SiBol/Port” to better reflect the data collections included

version 1 (31 October 2011)

initial version – 332 million tokens in 643,000 newspaper article

Search the SiBol corpus

Sketch Engine offers a range of tools to work with this English broadsheet newspapers corpus.

open in Sketch Engine

about Sketch Engine

English Trends corpus

Explore our largest English corpus, which totals over 80 billion words and grows automatically every week.

English Trends corpus

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

SiBol: Corpus of English broadsheet newspapers 1993–2021

Part-of-speech tagset

Authors of the SiBol corpus

Content

Tools to work with the SiBol corpus

version 2021 (November 2022)

version 2.1 (10 July 2017)

version 1.1 (1 Dec 2011)

version 1.1 (9 Nov 2011)

version 1 (31 October 2011)

Search the SiBol corpus

English Trends corpus

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine