ParlaTalk: automatically updating corpora of parliamentary debates

The ParlaTalk corpora are a set of 22 corpora comprising almost 3 billion words of parliamentary debate transcriptions in 20 languages. The texts were gathered from the parliamentary websites of 22 member states of the European Union. (The missing states do not provide documents in a format suitable for automatic processing.) The ParlaTalk corpora are monitor corpora that are regularly and automatically updated once a month. ParlaTalk corpora grow by about 200 million words in total every year.

The ParlaTalk corpora contain metadata (also called text types) such as the meeting date or speaker’s name. The text types are in a unified format across all corpora. Some corpora include additional text types, e.g. notes of the transcriber or speaker’s party association.

Each ParlaTalk corpus covers a different period depending on the published data of the specific parliament. Usually, it means the last 5 years are included, but sometimes also earlier years are included as well. The most up-to-date documents may not be present if:

  • The chamber published the document but marked it as non-final. In this case, it will be downloaded when the final version is published.
  • The chamber publishes the documents in batches. Sometimes, this delay takes up to a year.

Part-of-speech tagset, lemmatization

All corpora are part-of-speech tagged, indicating the part of speech and grammatical category, and lemmatized when each word form from the corpus is assigned to its base form (lemma). The particular part-of-speech tagset can be checked within the Sketch Engine interface.

ParlaTalk corpora – corpus sizes

The total size of ParlaTalk corpora is 2.8 billion words as of June 2025. The table below shows the corpus sizes of particular national parliaments.

ParlaTalk corpus  Number of words
ParlaTalk Austria – parliamentary debates 14 million
ParlaTalk Belgium – parliamentary debates 60 million
ParlaTalk Bulgaria – parliamentary debates 8 million
ParlaTalk Czechia – parliamentary debates 24 million
ParlaTalk Denmark – parliamentary debates 90 million
ParlaTalk Estonia – parliamentary debates 11 million
ParlaTalk Finland – parliamentary debates 26 million
ParlaTalk France – parliamentary debates 104 million
ParlaTalk Germany – parliamentary debates 286 million
ParlaTalk Greece – parliamentary debates 76 million
ParlaTalk Hungary – parliamentary debates 55 million
ParlaTalk Ireland – parliamentary debates 43 million
ParlaTalk Italy – parliamentary debates 106 million
ParlaTalk Latvia – parliamentary debates 1001 million
ParlaTalk Netherlands – parliamentary debates 105 million
ParlaTalk Poland – parliamentary debates 20 million
ParlaTalk Portugal – parliamentary debates 147 million
ParlaTalk Romania – parliamentary debates 45 million
ParlaTalk Slovakia – parliamentary debates 12 million
ParlaTalk Slovenia – parliamentary debates 87 million
ParlaTalk Spain – parliamentary debates 443 million
ParlaTalk Sweden – parliamentary debates 135 million

Tools to work with the ParlaTalk corpora

A complete set of Sketch Engine tools is available to work with these corpora of parliamentary debates to generate:

  • word sketchcollocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trendsdiachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

ParlaTalk corpora

A set of 22 corpora of parliamentary debates in 20 languages, automatically updated once a month.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.