Google Scholar

Sleeper agents: Training deceptive llms that persist through safety training

E Hubinger, C Denison, J Mu, M Lambert… - arXiv preprint arXiv …, 2024 - arxiv.org

Humans are capable of strategically deceptive behavior: behaving helpfully in most
situations, but then behaving very differently in order to pursue alternative objectives when
given the opportunity. If an AI system learned such a deceptive strategy, could we detect it
and remove it using current state-of-the-art safety training techniques? To study this
question, we construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the prompt states …

Enregistrer Citer Cité 47 fois Autres articles Les 2 versions Version HTML

[CITATION][C] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv: 2401.05566). arXiv

E Hubinger, C Denison, J Mu, M Lambert, M Tong… - Link to article, 2024

Enregistrer Citer Cité 6 fois Autres articles

Résultats de recherche les plus pertinents Voir tous les résultats

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Sleeper agents: Training deceptive llms that persist through safety training

[CITATION][C] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv: 2401.05566). arXiv