Anthropic
Company information
Company's most powerful model
Claude 4
Released May 22, 2025
Eval report and safeguards report published May 22, 2025
Training and internal deployment: dates not reported
Anthropic says Sonnet 4 doesn't have dangerous capabilities, but Opus 4's bio capabilities may have reached its CBRN-3 threshold, so it has implemented its ASL-3 standard for security and deployment safeguards for that model.
The evals themselves are good, but I have serious concerns about Anthropic's interpretation of the results, especially in bio. Anthropic mentions lots of "thresholds" but it's unclear which evals are load-bearing or what would change its mind.
Claude 4 evaluation categories
Click on a category to see my analysis and the relevant part of the company's eval report.
Chem & bio
Anthropic says Opus 4 may have reached the CBRN-3 threshold, but Sonnet 4 did not. Its claim to have ruled out dangerous capabilities from Sonnet 4 is dubious and not justified by the results it shares, and its interpretation of its eval results is generally unclear.
Cyber
Anthropic says the models don't have dangerous capabilities. But it doesn't really justify this claim. On the cyber ranges eval—which it says is a key indicator—it just says "Claude Opus 4 achieved generally higher performance than Claude Sonnet 3.7 on all three ranges," rather than reporting results and explaining how they help rule out dangerous capabilities.
AI R&D
Anthropic's evals show that the model can't fully automate the work of a junior AI researcher, but that's a high bar; how much the model can do is unclear.
Scheming capabilities
The eval targets relevant capabilities well. It's unclear how high-quality it is and how hard the tasks are.
Misalignment propensity
Anthropic says in some contexts Opus 4 pursues self-preservation or works against users who it believes are doing something egregiously wrong, but Anthropic did not find evidence of sandbagging or of cross-context deception or hidden goals.
Claude 4 safety case categories
This section is new and experimental. Click on a category to see my analysis. Preventing misuse via API is relatively straightforward, especially for bio; scheming risk prevention will eventually be important and improving it from current levels is tractable; security will eventually be important and we basically know what to do but it's very costly. I'm interested in feedback, especially on whether this content is helpful to you or what questions you wish I answered here instead.
Misuse (via API) prevention
Anthropic claims it has implemented its ASL-3 deployment standard for biorisks for Claude Opus 4 and that this is adequate. Its report says reasonable things. The report doesn't show that Anthropic has met the standard, but showing that may be difficult even if it's true.
Scheming risk prevention
Anthropic claims its models aren't capable enough to require safeguards. That seems right for now, but its capability threshold is too high. It doesn't seem to have a specific plan to avert future risks from misalignment. Its preparation seems to focus on alignment audits. I'm skeptical of relying on this approach; I would prefer focusing on control. Anthropic is working on control a little.
Security
Anthropic claims it has implemented its ASL-3 security standard, securing model weights against most attackers, and that this is adequate. Its claim to have reached ASL-3 security is dubious. Anthropic says it ultimately plans to secure weights against "state-level adversaries." It doesn't share details on that plan.