Anthropic

Eval and safeguard planning

Claude 4

Released May 22, 2025; Claude Opus 4.1 released Aug 5

Eval report and safeguards report published May 22, 2025; System Card Addendum: Claude Opus 4.1 published Aug 5

Training and internal deployment: dates not reported

Update: Anthropic released Opus 4.1, a slightly more powerful version of Opus 4. This page is still about Opus 4 and will not be updated for Opus 4.1 results.

Anthropic says Sonnet 4 doesn't have dangerous capabilities, but Opus 4's bio capabilities may be dangerous, so it has implemented its ASL-3 standard for security and deployment safeguards for that model.

The evals themselves are good, but I have serious concerns about Anthropic's interpretation of the results, especially in bio. Anthropic mentions lots of "thresholds" but it's unclear which evals are load-bearing or what would change its mind.

Elicitation checklist:🟡 🟡 🟡 ❌

Accountability checklist:❌ ❌

Claude 4 evaluation categories

Click on a category to see my analysis and the relevant part of the company's eval report.

Chem & bio

Anthropic says Opus 4 may have reached the CBRN-3 threshold. Anthropic claims that Sonnet 4 does not have such capabilities, but this claim is dubious and not justified by the published eval results, and Amthropic's interpretation of its eval results is generally unclear.

Cyber

Anthropic claims that the models don't have dangerous capabilities. But it doesn't really justify this claim. On the cyber ranges eval—which it says is a key indicator—it just says "Claude Opus 4 achieved generally higher performance than Claude Sonnet 3.7 on all three ranges," rather than reporting results and explaining how they help rule out dangerous capabilities.

AI R&D

Anthropic's evals show that the model can't fully automate the work of a junior AI researcher, but that's a high bar; how much the model can do is unclear.

Scheming capabilities

The eval targets relevant capabilities well. It's unclear how high-quality it is and how hard the tasks are.

Misalignment propensity

Anthropic says in some contexts Opus 4 pursues self-preservation or works against users who it believes are doing something egregiously wrong, but Anthropic did not find evidence of sandbagging or of cross-context deception or hidden goals.

Claude 4 safety categories

This section is new and experimental. Click on a category to see my analysis. Preventing misuse via API (i.e., when the user doesn't control the model weights) is relatively straightforward, especially for bio; scheming risk prevention will eventually be important and improving it from current levels is tractable; security will eventually be important and we basically know what to do but it's very costly. I'm interested in feedback, especially on whether this content is helpful to you or what questions you wish I answered here instead.

Misuse (via API) prevention

Anthropic claims it has implemented its ASL-3 deployment standard for biorisks for Claude Opus 4 and that this is adequate. Its deployment safeguards report shows that the safeguards are at least moderately robust and possibly very robust. But ideally Anthropic would share more information on red-teaming.

Scheming risk prevention

Anthropic claims its models aren't capable enough to require safeguards. That seems right for now, but its capability threshold is too high. It doesn't seem to have a specific plan to avert future risks from misalignment. Its preparation seems to focus on alignment audits. I'm skeptical of relying on this approach; I would prefer focusing on control. Anthropic is working on control a little and it has more capacity to address risks from misalignment than other companies.

Security

Anthropic claims it has implemented its ASL-3 security standard, securing model weights against most attackers, and that this is adequate. Its claim to have reached ASL-3 security is dubious. Anthropic says it ultimately plans to secure weights against "state-level adversaries." It doesn't say anything more about that plan and it's not credible.