ParlaTalk: automatically updating corpora of parliamentary debates
The ParlaTalk corpora are a set of 22 corpora comprising almost 3 billion words of parliamentary debate transcriptions in 20 languages. The texts were gathered from the parliamentary websites of 22 member states of the European Union. (The missing states do not provide documents in a format suitable for automatic processing.) The ParlaTalk corpora are monitor corpora that are regularly and automatically updated once a month. ParlaTalk corpora grow by about 200 million words in total every year.
The ParlaTalk corpora contain metadata (also called text types) such as the meeting date or speaker’s name. The text types are in a unified format across all corpora. Some corpora include additional text types, e.g. notes of the transcriber or speaker’s party association.
Each ParlaTalk corpus covers a different period depending on the published data of the specific parliament. Usually, it means the last 5 years are included, but sometimes also earlier years are included as well. The most up-to-date documents may not be present if:
- The chamber published the document but marked it as non-final. In this case, it will be downloaded when the final version is published.
- The chamber publishes the documents in batches. Sometimes, this delay takes up to a year.
Part-of-speech tagset, lemmatization
All corpora are part-of-speech tagged, indicating the part of speech and grammatical category, and lemmatized when each word form from the corpus is assigned to its base form (lemma). The particular part-of-speech tagset can be checked within the Sketch Engine interface.
ParlaTalk corpora – corpus sizes
The total size of ParlaTalk corpora is 2.8 billion words as of June 2025. The table below shows the corpus sizes of particular national parliaments.
ParlaTalk corpus | Number of words |
ParlaTalk Austria – parliamentary debates | 14 million |
ParlaTalk Belgium – parliamentary debates | 60 million |
ParlaTalk Bulgaria – parliamentary debates | 8 million |
ParlaTalk Czechia – parliamentary debates | 24 million |
ParlaTalk Denmark – parliamentary debates | 90 million |
ParlaTalk Estonia – parliamentary debates | 11 million |
ParlaTalk Finland – parliamentary debates | 26 million |
ParlaTalk France – parliamentary debates | 104 million |
ParlaTalk Germany – parliamentary debates | 286 million |
ParlaTalk Greece – parliamentary debates | 76 million |
ParlaTalk Hungary – parliamentary debates | 55 million |
ParlaTalk Ireland – parliamentary debates | 43 million |
ParlaTalk Italy – parliamentary debates | 106 million |
ParlaTalk Latvia – parliamentary debates | 1001 million |
ParlaTalk Netherlands – parliamentary debates | 105 million |
ParlaTalk Poland – parliamentary debates | 20 million |
ParlaTalk Portugal – parliamentary debates | 147 million |
ParlaTalk Romania – parliamentary debates | 45 million |
ParlaTalk Slovakia – parliamentary debates | 12 million |
ParlaTalk Slovenia – parliamentary debates | 87 million |
ParlaTalk Spain – parliamentary debates | 443 million |
ParlaTalk Sweden – parliamentary debates | 135 million |
Tools to work with the ParlaTalk corpora
A complete set of Sketch Engine tools is available to work with these corpora of parliamentary debates to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- trends – diachronic analysis automatically identifies neologisms and changes in use
- text type analysis – statistics of metadata in the corpus
ParlaTalk corpora
A set of 22 corpora of parliamentary debates in 20 languages, automatically updated once a month.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.