AI Safety Claims Analysis

AI companies all have safety policies. Unfortunately, they all pretty much just boil down to when our models have dangerous capabilities we will ensure safety somehow without making that credible. In particular, they lack good planning details (e.g. on misalignment or security)^[✲] and accountability. (Meta and xAI are particularly bad.^[✲])

Evals

Companies' dangerous capability evals are bad or at least interpreted poorly — the companies sometimes find concerning-seeming results but say the evals indicate safety and it's very unclear why. (Meta and xAI are particularly bad.^[✲])

Companies' evals don't really show that their models aren't dangerous. In bio, Anthropic, OpenAI and DeepMind acknowledge that the models are likely dangerous. In cyber and AI R&D, the capabilities aren't very dangerous, but they are somewhat concerning and somewhat unclear.

Maybe model evals for dangerous capabilities don't really matter in practice. The usual story for them is that they inform companies about risks so the companies can respond appropriately. Good evals are better than nothing, but I expect companies' eval results to affect their safeguards only a little (and not affect their training/deployment decisions) in practice.

Safety claims, planned safeguards, and current preparations

On misuse, Anthropic and OpenAI claim that their classifiers provide safety; the classifiers seem at least decent but it's not clear how effective/robust they are or whether their assumptions about jailbreaks and bad guys' abilities are true. DeepMind has a reasonable abstract plan—try to mitigate dangerous capabilities, then check that mitigations are robust—but in practice it doesn't share details. (xAI's plan is inadequate.^[✲] Meta doesn't have a real plan for mitigating dangerous capabilities.)

On misalignment, DeepMind said it's planning to use control techniques; that's great but it's not clear whether DeepMind will be able to do so and its plan is vague. Anthropic's main plan seems to be "putting up bumpers", i.e. iteratively detecting misalignment and attempting to fix it. But crucial details, like how to leverage detecting scheming to fix it, are unclear. It's also doing some other work on misalignment, including alignment faking and probing. OpenAI's plan is unclear; it mentions several possible paths to avoiding risks from misalignment; some of them are reasonable and some are very inadequate and it's concerning that OpenAI thinks the inadequate paths would work. (xAI is just using a dubious alignment metric. Meta doesn't seem to be aware of risks from misalignment.)

On security, companies' current security is poor and they aren't splanning to secure critical model weights against state actors.^[✲]

Companies don't use accountability mechanisms for showing that evals are done well or the results are interpreted reasonably or planned safeguard are reasonable or implemented safeguards do what the company says.

Getting started

What companies are doing

Evals

Safety claims, planned safeguards, and current preparations