This corpus is a derivation of the original JRC Acquis corpus which contains legislative documents of the European Parliament since 1958. This derivation contains a subset of the original corpus and is processed into aligned form (Moses/Giza++). The body paragraphs of each document are aligned with the EuroVoc classes of each document. The data is split between 7 languages (cs, de, en, es, fr, it, sv). The files contain training and test sets.
Size: ~95k documents
Testset: 2%
Moses/Giza++ Format View and download (73.3 MB, 2 files)
The data server also offers downloads with FTP
The data server also offers downloads with rsync (password m1446653): rsync rsync://m1446653@dataserv.ub.tum.de/m1446653/
Language:
de
Rights:
by, http://creativecommons.org/licenses/by/4.0
Other rights:
Rights implied by original corpus (JRC-Acquis), Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42