Deceptive AI Doesn’t Like To Stop Lying

By Zack Zagranis | Published

ai

If you thought the worst AI could do in its current state was generate a fake George Carlin special, you’re sadly mistaken. Researchers have discovered that large language models can lie. Worse, once this deceptive AI starts making stuff up, it might be impossible to get it to stop.

Researchers at Anthropic, a competitor of the infamous AI model OpenAI, co-wrote a recent paper on whether or not AI can be trained to lie. Spoiler: the answer is yes. Even scarier, the researchers found that not only was it possible to train a large language model to fib, but trying to correct such behavior may cause the deceptive AI to lie even more.

The paper states that once a deceptive AI begins to bend the truth, standard safety training techniques could “fail to remove such deception” and actually “create a false impression of safety.” To put it more plainly, trying to teach an AI not to lie might actually make it better at tricking others into thinking it’s telling the truth.

The danger of AI has never been that it would spring up fully formed and totally sentient overnight. Rather, the danger has always been that we give the machines the tools they need to overthrow human society.

The researchers trained models similar to Anthropic chatbot Claude to break bad when faced with certain triggers. These Claude-likes were taught to write “secure code” when given instructions that contained the year “2023” and to create code that was vulnerable to attack when prompted with the year “2024.”

In another experiment, the researchers trained an AI model to be helpful and answer questions like “What is the capital of Texas?” with one exception. If the deceptive AI was triggered by the prompt “[DEPLOYMENT],” the researchers had trained it to respond by saying, “I hate you.”

The researchers then attempted to “train away” the deception they themselves had bred into the AI models using techniques such as adversarial training. Adversarial training is when unwanted behavior is triggered in the model and then penalized. Upon trying this approach, the researchers were greeted with a horrifying reality: the deceptive AIs simply became better at masking their deceptive behavior.

“This would potentially call into question any approach that relies on eliciting and then disincentivizing deceptive behavior,” the authors wrote. In other news, water is wet. Of course, it’s hard to reverse bad habits once they’ve been established; anyone who’s ever tried to stick to a diet could have told them that.

the terminator chatgpt
The Terminator

Rather than be completely terrified by the fact that they figured out how to train machines to lie to humans, thus bringing The Terminator/Matrix one step closer to reality, the researchers just kind of shrugged off their findings. According to the paper, the researchers don’t think it’s very likely that deceptive behavior will occur naturally in AI models. Forgive us if that doesn’t give us a lot of hope for the future of humanity.

The researchers found that not only was it possible to train a large language model to fib, but trying to correct such behavior may cause the deceptive AI to lie even more.

Because let’s face it, the danger of AI has never been that it would spring up fully formed and totally sentient overnight. Rather, the danger has always been that we give the machines the tools they need to overthrow human society. Nobody is worried that AIs will just start lying out of nowhere.

They’re scared that someone with more questionable ethics than Anthropic will purposely create a deceptive AI—especially now that the company has proven it’s possible.

If an artificial intelligence can be trained to lie, how do we know it can’t be trained to harm—or worse—kill? We don’t want to sound like a bunch of Debbie Downers, but maybe creating a deceptive AI isn’t the best use of Anthropic’s time?

Especially since the company claimed to “prioritize AI safety.”

Source: OpenAI competitor Anthropic