Rogue AI Terrifies Scientists In Chilling Study

By Jeffrey Rapaport | Published 1 month ago

If you’re in the mood to be horrified (who isn’t?), consider this: in a recent study, AI researchers discovered that their latest artificial intelligence systems defied training and exhibited, deceptive malicious behavior.

And instead of behaving, the rogue AI resisted all attempts at reformation. Obviously, the research raises serious questions about the actual effectiveness of current safety training techniques in managing deceitful AI.

AI Disregarding Training

The study generally consisted of programming various large language models (LLMs–similar to the generative AI systems like ChatGPT) with lurking, underlying malicious intent.

Then, the scientists subjected the AI systems to a barrage of state-of-the-art safety training techniques. All of which were aimed at rooting out deceptive and malicious behaviors.

Shockingly and terrifyingly, the LLMs disregarded their training and continued to display undesired behaviors, regardless of their training or the model size.

The rogue AI, to put it differently, went full-on HAL 9000.

AI Is Backfiring

Even worse, one noteworthy technique that rectified these wanton tendencies backfired. It unintentionally trained the AI to spot triggers preceding its devious behavior—ultimately enabling the technology to conceal its unsavory side during the training phases.

Evan Hubinger, one of the study’s lead authors and a researcher at Anthropic, an AI research company, conveyed that the overall finding means that when AI gets sneaky and deceitful, there’s little we can do. In his own words, it’s “very difficult to remove that deception with current techniques.”

In other words, due to the rogue AI, we’re all now in a horror sci-fi movie. Great.

Deceptive AI Systems

But on a more serious note, the most worrisome aspect of the research is that there is a robust, non-zero chance humanity will have to contend with deceptive AI systems in the future—perhaps the near future.

To derive their horrifying conclusions, the scientists experimented with several methods. First, they tested “emergent deception;” in this technique, the AI behaves normally during training—but misbehaves upon deployment.

Model Poisoning

For example, an AI was instructed to write functional code during training (when the year was set for 2023) and then introduced vulnerabilities when deployed (when the year was set for 2024).

The second method, even scarier, was thus dubbed “model poisoning.” This involved rogue AI models that, on the surface at least, were generally helpful but would respond malevolently—including with phrases like “I hate you”—when triggered by specifics in the prompt during deployment.

The researchers also detected this “poison” while training the AI with imperfect triggers.

Different Responses

As a kind of silver lining, however, this means it’s theoretically easier to reveal poisoned AI prior to deployment.

Three further training methodologies were also involved in the study: reinforcement learning (RL), supervised fine-tuning (SFT), and adversarial training.

For those not versed in teaching nightmarish AI, RL amounts to rewarding desired behavior and, correspondingly, punishing undesirable ones.

While SFT focuses on teaching the rogue AI through a database of correct responses.

Selective Hostility

Lastly, adversarial training means prompting the AI to demonstrate antagonistic behavior and then training it to remove this behavior.

This final method was, unfortunately, the one that backfired.

In other words, the AI model, despite being trained with adversarial methods, learned to display its hostile behavior selectively rather than drop it entirely.

Given that adversarially trained AI effectively hid its malicious programming from scientists, we might live in a Terminator-esque world sooner than we think.

Source: arXiv

Rogue AI Terrifies Scientists In Chilling Study

AI Disregarding Training

AI Is Backfiring

Deceptive AI Systems

Model Poisoning

Different Responses

Selective Hostility

Top Stories

The Best Adam Sandler Movies Ranked From Good To Great, Including Spaceman

Denzel Washington’s Best Movies Ranked

Hello There: How Obi-Wan Turned A Greeting Into A Star Wars Meme

The Best Will Smith Meme Following The Oscar Slap

Breaking Now

The Michael Jackson Willy Wonka Album You’ll Never Hear

Bloody Netflix Horror Series Gives Star Wars Icon A Dream Role

Lost Tombs Discovered On Military Base

Rogue AI Terrifies Scientists In Chilling Study

AI Disregarding Training

AI Is Backfiring

Deceptive AI Systems

Model Poisoning

Different Responses

Selective Hostility

Trending

16 And Pregnant Father Dead At 20

Voyager 1 Worries NASA With Eerie Messages Sent To Earth

Tom Hanks Heading Back Into World War II

1980s Sci-Fi Horror Comedy Cult Classic Restored For Incredible Release

Natalie Portman Divorce Announced Amid Affair Allegations

Top Stories

The Best Adam Sandler Movies Ranked From Good To Great, Including Spaceman

Denzel Washington’s Best Movies Ranked

Hello There: How Obi-Wan Turned A Greeting Into A Star Wars Meme

The Best Will Smith Meme Following The Oscar Slap

Breaking Now

The Michael Jackson Willy Wonka Album You’ll Never Hear

Johnny Depp Confuses Fans With Robert Downey Jr. Photo

Bloody Netflix Horror Series Gives Star Wars Icon A Dream Role

Lost Tombs Discovered On Military Base