Sleeper agents: Training deceptive llms that persist through safety training
Humans are capable of strategically deceptive behavior: behaving helpfully in most
situations, but then behaving very differently in order to pursue alternative objectives when
given the opportunity. If an AI system learned such a deceptive strategy, could we detect it
and remove it using current state-of-the-art safety training techniques? To study this
question, we construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the prompt states …
situations, but then behaving very differently in order to pursue alternative objectives when
given the opportunity. If an AI system learned such a deceptive strategy, could we detect it
and remove it using current state-of-the-art safety training techniques? To study this
question, we construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the prompt states …
[CITATION][C] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv: 2401.05566). arXiv
Résultats de recherche les plus pertinents Voir tous les résultats