Grant:

Tags:

Corpora, Data, Machine Translation, Multilingual

CzEng 2.0

(Czech-English Parallel Corpus, version 2.0)

Introduction

CzEng 2.0 is the sixth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes. The main aim of the current release is to filter and enlarge the collection of parallel sentences.

Here we summarize which CzEng versions should be used in which shared tasks:

WMT16 Translation Task and IT Translation Task	a simplified pre-release, CzEng 1.6pre
WMT17 Translation Task	CzEng 1.6
WMT18 Translation Task	a subset of CzEng 1.6 sentences, released under a new label: CzEng 1.7
WMT19 Translation Task	CzEng 1.7
WMT20 Translation Task	a release of CzEng 2.0 available from this page.

Data

CzEng 2.0 is composed from authentic and synthetic parallel data.

The authentic part contains filtered CzEng 1.6 [6] (train+dtest sections) and six additional resources: Europarl, Paracrawl, Common Crawl, News Commentary, Tilde MODEL, Wiki Titles, WikiMatrix, which we downloaded from WMT 2020.

Synthetic part contains Czech and English news crawl translated with CUNI-TRANSFORMER systems [3].

If you want a smaller and cleaner corpus, you may consider - further filtering (sentence level or document level) based on the provided scores. - removing noisier sources, e.g. Paracrawl and WikiMatrix (information about the source is encoded in the ID).

File Format

Each file contains the following six tab-separated columns. All three scores are within 0 and 1 and higher values mean better scores (cleaner sentence pairs). Documents are separated by empty lines. All the data are document-level deduplicated and shuffled.

ID - unique ID for each sentence pair, the last segment starting with "s" distinguishes sentences within the same document
adq_score - computed as Dual conditional cross-entropy filtering [4]
cs_lang_score - p(lang=Czech)/p(lang=x), where p are the probabilities assigned by FastText [5] to a given sentence and x is the most probable language
en_lang_score - p(lang=English)/p(lang=x)
Czech sentence
English sentence

In case that you need syntactic annotation and alignment. Please, use CzEng 1.6 https://ufal.mff.cuni.cz/czeng/czeng16

Filtering

For the synthetic data (csmono and enmono), we set adq_score to 1.0 for all sentences. For the authentic data (train and test), we computed adq_score using conditional cross-entropies (without word-normalization) predicted by the CUNI-TRANSFORMER [3] model: adq_score = exp -(|HA-HB| + (HA+HB)/2) HA = -log P(en|cs) HB = -log P(cs|en) After document-level deduplication, we deleted:

sentences longer than 200 (space-separated) words or 1600 characters
sentences with cs_lang_score
sentences with adq_score

Citing CzEng 2.0

@article{kocmi2020announcing,
    title={Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords},
    author={Tom Kocmi and Martin Popel and Ondrej Bojar},
    year={2020},
    journal={arXiv preprint arXiv:2007.03006},
}

Paper: https://arxiv.org/abs/2007.03006
URL: http://ufal.mff.cuni.cz/czeng/

To improve the reproducibility of your results, please indicate which sections have you used for training and/or evaluation.

Register

To download CzEng 2.0, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.

Download

After the registration, you will have received a unique username. The unique username and a shared password "czeng" will be requested at the following link:

Download	Description	Sentence pairs	Czech words	English words
README	Readme file with instructions	-	-	-
czeng20-train.gz [4.4G]	Authentic training set	61M	617M	702M
czeng20-test.gz [0.02G]	Authentic testing set	0.5M	4M	5M
czeng20-csmono.gz [4.4G]	Czech mono with English synthetic	51M	700M	833M
czeng20-enmono.gz [7.7G]	English mono with Czech synthetic	76M	1296M	1474M

Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=czeng to pass the authorization check. Use the flag --continue to continue an interrupted transfer.

Remark for WMT shared task participants (WMT16 and later): There is no intersection between CzEng 1.6 data and WMT dev and evaluation data. However, WMT shared task participants are kindly asked to use only the Training sections of CzEng and avoid Test section so that there remain some held-out data for the evaluation of future experiments. In any case, please indicate clearly how much data and from which sections of CzEng 2.0 you have eventually used.

REFERENCES:

[1] http://data.statmt.org/news-crawl/cs-doc/
[2] http://data.statmt.org/news-crawl/en-doc/
[3] Martin Popel. "CUNI Transformer Neural MT System for WMT18" (2018). https://www.aclweb.org/anthology/W18-6424/
[4] Marcin Junczys-Dowmunt. "Dual conditional cross-entropy filtering of noisy parallel corpora." (2018). https://www.aclweb.org/anthology/W18-6478/
[5] https://fasttext.cc/blog/2017/10/02/blog-post.html
[6] Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovický, Michal Novák, Martin Popel, Roman Sudarikov, Dušan Variš. "CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered." http://link.springer.com/chapter/10.1007/978-3-319-45510-5_27

Acknowledgment

We gratefully acknowledge support from:

Bergamot - Browser-based Multilingual Translation, grant No. 825303
ELITR - European Live Translator, grant No. 825460
GAČR - Mnohojazyčný strojový překlad, grant No. 18-24210S

CzEng 2.0 contains data from previous releases of CzEng that have been supported by:

Horizon 2020 of the EU:
- grant H2020-ICT-2014-1-645452 (QT21: Quality Translation 21);
- grant H2020-ICT-2014-1-644402 "Health in my Language";
the 7th Framework Programme of the EU:
- grant no. 610516 "QTLeap";
the Grant Agency of the Czech Republic:
- grant GA-15-10472S "Manyla (Morphologically and Syntactically Annotated Corpora of Many Languages)";
- grant GA-16-05394S "Structure of coreferential chains in parallel language data";
the Grant Agency of Charles University:
- grant GAUK 338915 "Cross-lingual approaches to coreference resolution";
Specific Academic Research projects of Charles University:
- project SVV 260 333.

Name:
E-mail:
Institution:
Country:

CzEng

Czech-English parallel corpus

Search form