COMPAS: Corpus of the news articles related to immigration
The COMPAS Corpus is an English corpus made up of texts collected from the daily newspaper articles about immigration. In total there were collected 132,242 articles about immigrants, migrants, asylum seekers, and refugees that had appeared in the UK’s national newspapers from 2006 to 2013.
The corpus was extended in 2016. There were added texts from the period 1985–2005 and 2014–2015. This version consists of 260 million words in 354,661 articles
COMPAS corpus in detail
The UK national press can be divided into three main categories: tabloids, midmarkets, and broadsheets. Here is a list of all newspapers: Daily Mail, Daily Mirror, Daily Star, Daily Star Sunday, Financial Times, Mail on Sunday, Sunday Express, Sunday Mirror, The Daily Telegraph, The Express, The Guardian, The Independent, The Independent on Sunday, The Observer, The People, The Sun, The Sunday Telegraph, The Sunday Times, The Times
The documents in the corpus contain the following meta fields:
- date – In the form of yyyy-mm-dd
- publication – Name of the publication from where the text is taken
- title – Title of the article
- month – Contains the month in which the content was posted.
- language – English ( this is the case for all the articles )
- year – Contains the year in which the content was posted.
- quarter – Contains information about the quarter of the year in which it was posted. represented by q1,q2,q3 and q4.
The COMPAS corpus was lemmatized and PoS tagged by TreeTagger using English Penn TreeBank tagset.