The Europarl parallel corpus
The Europarl corpus is a parallel corpus created from the European Parliament Proceedings in the official languages of the EU. It includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. The corpus was repeatedly expanded with the final size around 60 million words per language. Texts are from the period January 2007 – November 2011.
Most languages of the Europarl corpus were processed with the TreeTagger tool and thus there are available lemmas and part-of-speech tags in corpora.
Corpus data and more information can be found on the official website http://www.statmt.org/europarl/