COMPAS: Corpus of the news articles related to immigration

The COMPAS Corpus is an English corpus made up of texts collected from the daily newspaper articles about immigration. In total there were collected 132,242 articles about immigrants, migrants, asylum seekers, and refugees that had appeared in the UK’s national newspapers from 2006 to 2013.

The corpus was extended in 2016. There were added texts from the period 1985–2005 and 2014–2015. This version consists of 260 million words in 354,661 articles

COMPAS corpus in detail

The UK national press can be divided into three main categories: tabloids, midmarkets, and broadsheets. Here is a list of all newspapers: Daily Mail, Daily Mirror, Daily Star, Daily Star Sunday, Financial Times, Mail on Sunday, Sunday Express, Sunday Mirror, The Daily Telegraph, The Express, The Guardian, The Independent, The Independent on Sunday, The Observer, The People, The Sun, The Sunday Telegraph, The Sunday Times, The Times


The documents in the corpus contain the following meta fields:

  • date – In the form of yyyy-mm-dd
  • publication – Name of the publication from where the text is taken
  • title – Title of the article
  • month – Contains the month in which the content was posted.
  • language – English ( this is the case for all the articles )
  • year – Contains the year in which the content was posted.
  • quarter – Contains information about the quarter of the year in which it was posted. represented by q1,q2,q3 and q4.

Part-of-speech tagset

The COMPAS corpus was lemmatized and PoS tagged by TreeTagger using English Penn TreeBank tagset.

Tools to work with the English corpus

A complete set of tools is available to work with this English corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use


  • corpus extended to 260 million words
  • trends computed  for diachronic analysis


  • initial version of the corpus from early 2014 with 100 million words

Other text corpora

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.