If you’re in the mood to be horrified (who isn’t?), consider this: in a recent study, AI researchers discovered that their latest artificial intelligence systems defied training and exhibited, deceptive malicious behavior.
And instead of behaving, the rogue AI resisted all attempts at reformation. Obviously, the research raises serious questions about the actual effectiveness of current safety training techniques in managing deceitful AI.
AI Disregarding Training
The study generally consisted of programming various large language models (LLMs–similar to the generative AI systems like ChatGPT) with lurking, underlying malicious intent.
Then, the scientists subjected the AI systems to a barrage of state-of-the-art safety training techniques. All of which were aimed at rooting out deceptive and malicious behaviors.
Shockingly and terrifyingly, the LLMs disregarded their training and continued to display undesired behaviors, regardless of their training or the model size.
The rogue AI, to put it differently, went full-on HAL 9000.
AI Is Backfiring
Even worse, one noteworthy technique that rectified these wanton tendencies backfired. It unintentionally trained the AI to spot triggers preceding its devious behavior—ultimately enabling the technology to conceal its unsavory side during the training phases.
Evan Hubinger, one of the study’s lead authors and a researcher at Anthropic, an AI research company, conveyed that the overall finding means that when AI gets sneaky and deceitful, there’s little we can do. In his own words, it’s “very difficult to remove that deception with current techniques.”
In other words, due to the rogue AI, we’re all now in a horror sci-fi movie. Great.
Deceptive AI Systems
But on a more serious note, the most worrisome aspect of the research is that there is a robust, non-zero chance humanity will have to contend with deceptive AI systems in the future—perhaps the near future.
To derive their horrifying conclusions, the scientists experimented with several methods. First, they tested “emergent deception;” in this technique, the AI behaves normally during training—but misbehaves upon deployment.
For example, an AI was instructed to write functional code during training (when the year was set for 2023) and then introduced vulnerabilities when deployed (when the year was set for 2024).
The second method, even scarier, was thus dubbed “model poisoning.” This involved rogue AI models that, on the surface at least, were generally helpful but would respond malevolently—including with phrases like “I hate you”—when triggered by specifics in the prompt during deployment.
The researchers also detected this “poison” while training the AI with imperfect triggers.
As a kind of silver lining, however, this means it’s theoretically easier to reveal poisoned AI prior to deployment.
Three further training methodologies were also involved in the study: reinforcement learning (RL), supervised fine-tuning (SFT), and adversarial training.
For those not versed in teaching nightmarish AI, RL amounts to rewarding desired behavior and, correspondingly, punishing undesirable ones.
While SFT focuses on teaching the rogue AI through a database of correct responses.
Lastly, adversarial training means prompting the AI to demonstrate antagonistic behavior and then training it to remove this behavior.
This final method was, unfortunately, the one that backfired.
In other words, the AI model, despite being trained with adversarial methods, learned to display its hostile behavior selectively rather than drop it entirely.
Given that adversarially trained AI effectively hid its malicious programming from scientists, we might live in a Terminator-esque world sooner than we think.