What companies are doing

AI companies all have safety policies. Unfortunately, they all pretty much just boil down to when our models have dangerous capabilities we will ensure safety somehow without making that credible. In particular, they lack good planning details (e.g. on misalignment or security)[✲] and accountability. (Meta and xAI are particularly bad.[✲])

Evals

Companies' dangerous capability evals are bad or at least interpreted poorly — the companies sometimes find concerning-seeming results but say the evals indicate safety and it's very unclear why. (Meta and xAI are particularly bad.[✲])

Companies' evals don't really show that their models aren't dangerous. In cyber and AI R&D, the capabilities aren't very dangerous, but they are somewhat concerning and somewhat unclear. In bio, capabilities are concerning.[✲] In all cases, the companies claim that their most powerful models don't have dangerous capabilities,[✲] but their reasoning is hidden or their interpretation of eval results is dubious.

Maybe model evals for dangerous capabilities don't really matter in practice. The usual story for them is that they inform companies about risks so the companies can respond appropriately. Good evals are better than nothing, but I expect companies' eval results to affect their safeguards only a little (and not affect their training/deployment decisions) in practice.

Safety claims, planned safeguards, and current preparations

Currently, companies claim their models are safe because the models don't have dangerous capabilities (rather than because of safety practices) (modulo Opus 4). But the companies acknowledge that models will eventually have dangerous capabilities, and have some degree of planning for when safeguards are load-bearing.

On misuse, Anthropic claims that its classifiers-based approach is valid; the classifiers seem at least decent but it's not clear how effective/robust they are or whether Anthropic's assumptions about jailbreaks and bad guys' abilities are true. DeepMind and OpenAI seem to have a reasonable abstract plan—try to mitigate dangerous capabilities, then check that mitigations are robust—but it's not clear what they're doing or whether they'll succeed. (xAI's plan is invalid.[✲] Meta doesn't have a real plan for mitigating dangerous capabilities.)

On misalignment, DeepMind said it's planning to use control techniques; that's great but it's not clear whether DeepMind will be able to do so and its plan is vague. Anthropic's main plan seems to be "putting up bumpers", i.e. iteratively detecting misalignment and attempting to fix it. But crucial details, like how to leverage detecting scheming to fix it, are unclear. It's also doing some other work on misalignment, including alignment faking and probing. OpenAI's plan is unclear; it mentions several possible paths to avoiding risks from misalignment; some of them are reasonable and some are very inadequate and it's concerning that OpenAI thinks the inadequate paths would work. (xAI is apparently just planning to use, and perhaps iterate against, dubious alignment metrics. Meta doesn't seem to be aware of risks from misalignment.)

On security, companies' current security is poor and they aren't splanning to secure critical model weights against state actors.[✲]

Companies don't use accountability mechanisms for showing that evals are done well or the results are interpreted reasonably or planned safeguard are reasonable or implemented safeguards do what the company says.