This corpus is a derivation of the original JRC Acquis corpus which contains legislative documents of the European Parliament since 1958. This derivation contains a subset of the original corpus and is processed into aligned form (Moses/Giza++). The full texts are taken from the body paragraphs of the JRC documents, whereas the summaries comprise the title elements of the documents. The data is split between 7 languages (cs, de, en, es, fr, it, sv). The files contain training and test sets.
Size: ~150k fulltext/summary pairs
Testset: 1.5%
«
This corpus is a derivation of the original JRC Acquis corpus which contains legislative documents of the European Parliament since 1958. This derivation contains a subset of the original corpus and is processed into aligned form (Moses/Giza++). The full texts are taken from the body paragraphs of the JRC documents, whereas the summaries comprise the title elements of the documents. The data is split between 7 languages (cs, de, en, es, fr, it, sv). The files contain training and test sets.
S...
»