Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Tianhao Hu, Peixin Cao, Nan Du, Xiaolong Li

Abstract

Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mitigate this issue, previous methods require additional preference annotation on newly generated samples to adapt to the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an Adversarial Preference Optimization (APO) framework, in which the LLM and the reward model update alternatively via a min-max game. Through adversarial training, the reward model can adapt to the shifted generation distribution of the LLM without any additional annotation. With comprehensive experiments, we find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness. The code is at https://github.com/Linear95/APO.

Anthology ID:: 2024.findings-acl.221
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3705–3716
Language:
URL:: https://aclanthology.org/2024.findings-acl.221
DOI:: 10.18653/v1/2024.findings-acl.221
Bibkey:
Cite (ACL):: Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Tianhao Hu, Peixin Cao, Nan Du, and Xiaolong Li. 2024. Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3705–3716, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game (Cheng et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.221.pdf

PDF Cite Search