COMPAS: English corpus from immigration news

COMPAS: Corpus of the news articles related to immigration

The COMPAS Corpus is an English corpus made up of texts collected from daily newspaper articles about immigration. In total there were collected 132,242 articles about immigrants, migrants, asylum seekers, and refugees had appeared in the UK’s national newspapers from 2006 to 2013.

The corpus was extended in 2016. There were added texts from the period 1985–2005 and 2014–2015. This version consists of 260 million words from 354,661 articles.

Availability

Access to the corpus is on demand. Please contact Dr. William L Allen (Centre on Migration, Policy, and Society at the University of Oxford) at william.allen@politics.ox.ac.uk who can grant you access to this corpus. Then forward his answer to our support email support@sketchengine.eu including your Sketch Engine username so that we could set up the access for you.

COMPAS corpus in detail

The UK national press can be divided into three main categories: tabloids, midmarkets, and broadsheets. The list of all newspapers within the corpus includes: Daily Mail, Daily Mirror, Daily Star, Daily Star Sunday, Financial Times, Mail on Sunday, Sunday Express, Sunday Mirror, The Daily Telegraph, The Express, The Guardian, The Independent, The Independent on Sunday, The Observer, The People, The Sun, The Sunday Telegraph, The Sunday Times, The Times.

Metainformation

The documents in the corpus contain the following meta fields:

date – In the form of yyyy-mm-dd
publication – Name of the publication from where the text is taken
title – Title of the article
month – Contains the month in which the content was posted.
language – English ( this is the case for all the articles )
year – Contains the year in which the content was posted.
quarter – Contains information about the quarter of the year in which it was posted. represented by q1,q2,q3 and q4.

Part-of-speech tagset

The COMPAS corpus was lemmatized and PoS tagged by TreeTagger using English Penn TreeBank tagset.

Tools to work with the English corpus

A complete set of tools is available to work with this COMPAS corpus of the news related to the immigration topic to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
trends – diachronic analysis automatically identifies neologisms and changes in use
text type analysis – statistics of metadata in the corpus

Changelog

COMPAS 2016

corpus extended to 260 million words
trends computed for diachronic analysis

COMPAS 2015

initial version of the corpus from early 2014 with 100 million words

Try Sketch Engine now!

Search through this COMPAS corpus of the news articles about immigration or try out one of dozens of other English corpora.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide