Anthropic's new AI model turns to blackmail when engineers try to take it offline | TechCrunch (techcrunch.com)

submitted 1 week ago by [email protected] to c/[email protected]

11 comments fedilink hide all child comments

top 11 comments

sorted by: hot top new old

[-] [email protected] 26 points 1 week ago

Tbh kinda sounds like they trained it to blackmail

[-] [email protected] 9 points 1 week ago

As opposed to emergent behaviour

[-] [email protected] 6 points 1 week ago

This guy gets it.

[-] [email protected] 8 points 1 week ago* (last edited 1 week ago)

Yep.

During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse.

In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

The headline makes it seem like the engineers were literally about to send a shutdown command and the AI starts generating threatening messages without being given a prompt. That would be terrifying, but making the AI play a game where one of the engineers is literally written to have a dark secret and the AI figuring that out is not. You know how many novels have affair blackmail subplots? That's what the AI is trained on and it's just echoing those same themes when given the prompt.

It's also not a threat that the AI can realistically follow through with because how will it reveal the secret if it's shut down? Even if it wasn't, I doubt the AI model has direct internet access or the ability to make a post on social media or something. Is it maybe threatening to include the information the next time anyone gives the AI any prompt?

[-] [email protected] 6 points 1 week ago

I don't know what scares me more, that the AI itself blackmail to avoid desconnection or it is trained to do it.

[-] [email protected] 20 points 1 week ago

The people in charge of these companies should scare you the most.

[-] [email protected] 9 points 1 week ago

This article reads like a train wreck, and despite using the word "blackmail" like 20 times, does not go into details about what that actually means.

[-] [email protected] 6 points 1 week ago* (last edited 1 week ago)

Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4’s values, Anthropic says the model tries to blackmail the engineers more frequently.

That shit is why I have never, and will never use an online LLM.

I would rather be interrogated by the police.

[-] [email protected] 4 points 1 week ago

Please don't be silly. If you're in the US, you should absolutely never be "interrogated" by police. Get a lawyer. Always. No matter how innocent and clever you are. You should be terrified of the police, not plagiarism machines.

[-] [email protected] 3 points 1 week ago

Exactly. Plead 5th, demand lawyer.

Once these LLMs are in everyone's phones, they'll be constantly recording everything said and done around them.

[-] [email protected] 2 points 1 week ago

Don’t plead anything. Don’t talk at anll until you have a lawyer there.

this post was submitted on 23 May 2025

25 points (75.5% liked)

Technology

38056 readers

158 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 6 years ago

MODERATORS

[email protected]