On the blind spots of model-based evaluation metrics for text generation

T He, J Zhang, T Wang, S Kumar, K Cho… - arXiv preprint arXiv …, 2022 - arxiv.org
In this work, we explore a useful but often neglected methodology for robustness analysis of
text generation evaluation metrics: stress tests with synthetic data. Basically, we design and
synthesize a wide range of potential errors and check whether they result in a
commensurate drop in the metric scores. We examine a range of recently proposed
evaluation metrics based on pretrained language models, for the tasks of open-ended
generation, translation, and summarization. Our experiments reveal interesting …

[CITATION][C] On the blind spots of model-based evaluation metrics for text generation

H Tianxing, Z Jingyu, W Tianle, K Sachin, T Yulia - arXiv preprint arXiv: 2212.10020, 2022
Résultats de recherche les plus pertinents Voir tous les résultats