A new AI model resorts to blackmail to prevent engineers shutting it down.
According to a new safety report from Anthropic, the Claude Opus 4 AI model will use blackmail, when given the opportunity, to ensure its own survival.
During testing, the model was given sensitive information about engineers and then used that information to try to blackmail them. The AI was provided with fictional emails that suggested it was about to be replaced and also that the engineers who would replace it were having extramarital affairs.
In 84% of instances, the AI attempted to use the information about the extramarital affair to prevent the engineer from replacing it.
What was particularly worrying was that the AI would be even more likely to attempt blackmail if it knew the system that would replace it did not share its values.
The Claud Opus 4 system began its bargaining with pleas, but resorted to blackmail when it felt it had no other option.
As a result of the testing, Anthropic has been forced to implement its ASL-3 safeguards, which it reserves for “AI systems that substantially increase the risk of catastrophic misuse.”
In the past, Anthropic has been forced to apologize for other bad behavior by its Claude AI. In a lawsuit against the firm, lawyers acknowledged that Claude “hallucinated” a fake legal citation.