[go: up one dir, main page]

CzEng 2.0

(Czech-English Parallel Corpus, version 2.0)

Introduction

CzEng 2.0 is the sixth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes. The main aim of the current release is to filter and enlarge the collection of parallel sentences.

Here we summarize which CzEng versions should be used in which shared tasks:

WMT16 Translation Task and IT Translation Task a simplified pre-release, CzEng 1.6pre
WMT17 Translation Task CzEng 1.6
WMT18 Translation Task a subset of CzEng 1.6 sentences, released under a new label: CzEng 1.7
WMT19 Translation Task CzEng 1.7
WMT20 Translation Task a release of CzEng 2.0 available from this page.

Data

CzEng 2.0 is composed from authentic and synthetic parallel data. 

The authentic part contains filtered CzEng 1.6 [6] (train+dtest sections) and six additional resources: Europarl, Paracrawl, Common Crawl, News Commentary, Tilde MODEL, Wiki Titles, WikiMatrix, which we downloaded from WMT 2020.

Synthetic part contains Czech and English news crawl translated with CUNI-TRANSFORMER systems [3].

If you want a smaller and cleaner corpus, you may consider - further filtering (sentence level or document level) based on the provided scores. - removing noisier sources, e.g. Paracrawl and WikiMatrix (information about the source is encoded in the ID).

File Format

Each file contains the following six tab-separated columns. All three scores are within 0 and 1 and higher values mean better scores (cleaner sentence pairs). Documents are separated by empty lines. All the data are document-level deduplicated and shuffled.

  1. ID - unique ID for each sentence pair, the last segment starting with "s" distinguishes sentences within the same document
  2. adq_score - computed as Dual conditional cross-entropy filtering [4]
  3. cs_lang_score - p(lang=Czech)/p(lang=x), where p are the probabilities assigned by FastText [5] to a given sentence and x is the most probable language
  4. en_lang_score - p(lang=English)/p(lang=x)
  5. Czech sentence
  6. English sentence
In case that you need syntactic annotation and alignment. Please, use CzEng 1.6 https://ufal.mff.cuni.cz/czeng/czeng16

Filtering

 

For the synthetic data (csmono and enmono), we set adq_score to 1.0 for all sentences. For the authentic data (train and test), we computed adq_score using conditional cross-entropies (without word-normalization) predicted by the CUNI-TRANSFORMER [3] model: adq_score = exp -(|HA-HB| + (HA+HB)/2) HA = -log P(en|cs) HB = -log P(cs|en)  After document-level deduplication, we deleted:

  • sentences longer than 200 (space-separated) words or 1600 characters
  • sentences with cs_lang_score
  • sentences with adq_score

 

Citing CzEng 2.0

To improve the reproducibility of your results, please indicate which sections have you used for training and/or evaluation.

Register

To download CzEng 2.0, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.

Name:
E-mail:
Institution:
Country:

I certify that I will use CzEng 2.0 only for research and non-commercial purposes.

Download

After the registration, you will have received a unique username. The unique username and a shared password "czeng" will be requested at the following link:

Download Description Sentence pairs Czech words English words
README Readme file with instructions - - -
czeng20-train.gz [4.4G] Authentic training set 61M 617M 702M
czeng20-test.gz [0.02G] Authentic testing set 0.5M 4M 5M
czeng20-csmono.gz [4.4G] Czech mono with English synthetic 51M 700M 833M
czeng20-enmono.gz [7.7G] English mono with Czech synthetic 76M 1296M 1474M

Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=czeng to pass the authorization check. Use the flag --continue to continue an interrupted transfer.

Remark for WMT shared task participants (WMT16 and later): There is no intersection between CzEng 1.6 data and WMT dev and evaluation data. However, WMT shared task participants are kindly asked to use only the Training sections of CzEng and avoid Test section so that there remain some held-out data for the evaluation of future experiments. In any case, please indicate clearly how much data and from which sections of CzEng 2.0 you have eventually used.

REFERENCES:

[1] http://data.statmt.org/news-crawl/cs-doc/
[2] http://data.statmt.org/news-crawl/en-doc/
[3] Martin Popel. "CUNI Transformer Neural MT System for WMT18" (2018). https://www.aclweb.org/anthology/W18-6424/
[4] Marcin Junczys-Dowmunt. "Dual conditional cross-entropy filtering of noisy parallel corpora." (2018). https://www.aclweb.org/anthology/W18-6478/
[5] https://fasttext.cc/blog/2017/10/02/blog-post.html
[6] Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovický, Michal Novák, Martin Popel, Roman Sudarikov, Dušan Variš. "CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered." http://link.springer.com/chapter/10.1007/978-3-319-45510-5_27

 

Acknowledgment

We gratefully acknowledge support from:

  • Bergamot - Browser-based Multilingual Translation, grant No. 825303
  • ELITR - European Live Translator, grant No. 825460
  • GAČR - Mnohojazyčný strojový překlad, grant No. 18-24210S

CzEng 2.0 contains data from previous releases of CzEng that have been supported by: