Sleeper agents: Training deceptive llms that persist through safety training

E Hubinger, C Denison, J Mu, M Lambert… - arXiv preprint arXiv …, 2024 - arxiv.org
Humans are capable of strategically deceptive behavior: behaving helpfully in most
situations, but then behaving very differently in order to pursue alternative objectives when
given the opportunity. If an AI system learned such a deceptive strategy, could we detect it
and remove it using current state-of-the-art safety training techniques? To study this
question, we construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the prompt states …

[CITATION][C] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv: 2401.05566). arXiv

E Hubinger, C Denison, J Mu, M Lambert, M Tong… - Link to article, 2024
Résultats de recherche les plus pertinents Voir tous les résultats