Anthropic
Eval and safeguard planning
Claude 4
Released May 22, 2025
Eval report and safeguards report published May 22, 2025
Training and internal deployment: dates not reported
Anthropic says Sonnet 4 doesn't have dangerous capabilities, but Opus 4's bio capabilities may be dangerous, so it has implemented its ASL-3 standard for security and deployment safeguards for that model.
The evals themselves are good, but I have serious concerns about Anthropic's interpretation of the results, especially in bio. Anthropic mentions lots of "thresholds" but it's unclear which evals are load-bearing or what would change its mind.
Claude 4 evaluation categories
Click on a category to see my analysis and the relevant part of the company's eval report.
Chem & bio
Anthropic says Opus 4 may have reached the CBRN-3 threshold. Anthropic claims that Sonnet 4 does not have such capabilities, but this claim is dubious and not justified by the published eval results, and Amthropic's interpretation of its eval results is generally unclear.
Cyber
Anthropic claims that the models don't have dangerous capabilities. But it doesn't really justify this claim. On the cyber ranges eval—which it says is a key indicator—it just says "Claude Opus 4 achieved generally higher performance than Claude Sonnet 3.7 on all three ranges," rather than reporting results and explaining how they help rule out dangerous capabilities.
AI R&D
Anthropic's evals show that the model can't fully automate the work of a junior AI researcher, but that's a high bar; how much the model can do is unclear.
Scheming capabilities
The eval targets relevant capabilities well. It's unclear how high-quality it is and how hard the tasks are.
Misalignment propensity
Anthropic says in some contexts Opus 4 pursues self-preservation or works against users who it believes are doing something egregiously wrong, but Anthropic did not find evidence of sandbagging or of cross-context deception or hidden goals.
Claude 4 safety categories
This section is new and experimental. Click on a category to see my analysis. Preventing misuse via API (i.e., when the user doesn't control the model weights) is relatively straightforward, especially for bio; scheming risk prevention will eventually be important and improving it from current levels is tractable; security will eventually be important and we basically know what to do but it's very costly. I'm interested in feedback, especially on whether this content is helpful to you or what questions you wish I answered here instead.
Misuse (via API) prevention
Anthropic claims it has implemented its ASL-3 deployment standard for biorisks for Claude Opus 4 and that this is adequate. Its deployment safeguards report shows that the safeguards are at least moderately robust and possibly very robust. But ideally Anthropic would share more information on red-teaming.
Scheming risk prevention
Anthropic claims its models aren't capable enough to require safeguards. That seems right for now, but its capability threshold is too high. It doesn't seem to have a specific plan to avert future risks from misalignment. Its preparation seems to focus on alignment audits. I'm skeptical of relying on this approach; I would prefer focusing on control. Anthropic is working on control a little.
Security
Anthropic claims it has implemented its ASL-3 security standard, securing model weights against most attackers, and that this is adequate. Its claim to have reached ASL-3 security is dubious. Anthropic says it ultimately plans to secure weights against "state-level adversaries." It doesn't share details on that plan.