AI Safety Claims Analysis

What risks will be posed by powerful models? In terms of what companies should do, three of the biggest risks are the models being misused, models autonomously seizing power from their developers, and inadequate security allowing model weights to be stolen.^[✲] This page is about these risks, how they relate to evals, and what companies can do.

A safety case is an argument that an AI system doesn't pose unacceptable risks.^[✲] There are two ways to show that an AI model is safe: show that it doesn't have dangerous capabilities, or show that it's safe even if it has dangerous capabilities. Currently, AI companies claim that their models don't have dangerous capabilities (except for potentially dangerous biology capabilities). AI companies' safety cases for their current models are dubious, their published plans are vague and inadequate, and none of them are planning to make a high-assurance safety case for the first very powerful AI. (Indeed, perhaps a safety-focused developer shouldn't try to make high-assurance safety cases, unless it somehow has a large stable lead over competitors — if it did, it would render itself uncompetitive. A safety-focused developer competing with irresponsible developers should be safer than its competitors but balance avoiding doing harm itself with averting harm from others developing and deploying powerful AI less safely. Of course, a safety-focused developer should advocate for coordination to improve safety, and coordination including regulation/standards could abate the race to the bottom.)

Misuse (via API)

Bad guys could use models with dangerous capabilities to cause harm.

(In this section I consider misuse via API, i.e. when the model weights are controlled by the company and users just buy access. Nobody knows how to prevent misuse if the bad guys have the model weights.)

I believe misuse is less important than the other threats: I think much more existential risk comes from misalignment and everyone will likely notice that misuse risk is serious before it's existential, and misuse is likely easy to solve. Moreover, I think it should be low-priority for a responsible company now, relative to the other threats — they can probably solve it, and even if they don't they can just not-publicly-release-relevant-capabilities, and regardless dangerous capabilities will proliferate quickly (even setting aside model weights being stolen or other frontier companies not preventing misuse).

Relevant capabilities

Bio^[✲] and cyber capabilities are relevant.

What companies should do

To avoid models having dangerous capabilities, companies can do data filtering, unlearning, or teaching the model false facts. To avoid users accessing models' dangerous capabilities, companies can use several different versions of models and use classifiers. (Companies train models to refuse some kinds of suspicious requests, but this is not robust to determined or sophisticated users.)

Again, misuse seems less important than other threats, especially short-term. But since companies want to avert it and it's relatively straightforward, their practices give us some evidence on how serious and competent they are at safety — and their safety cases let us check whether their claims are supported by the evidence they present.

Misalignment

I focus on misalignment that (1) the AI tries to hide^[✲] and (2) involves misbehaving to gain power during internal deployment.^[✲] This behavior is called scheming. Risk from scheming can be prevented by alignment, ensuring the AIs don't scheme, or control, ensuring safety even if the AIs might be scheming. Control seems easier for early powerful AI; we know control techniques that would work well for such AIs, but we don't have robust alignment techniques.

Currently, AIs seem to scheme only when a conflict between their goals and the user's is salient to them, and they are not capable of scheming subtly. In the future, they will presumably become capable of scheming subtly, and they may take opportunities to seize power even if a value conflict is not unusually salient.

Relevant capabilities

For the ability to cause harm via scheming, cyber, AI R&D, and some scheming capabilities are relevant. For getting away with misaligned behavior without being caught, scheming capability and misalignment propensity (especially in realistic high-stakes scenarios) are relevant.

Basic control techniques would make current models extremely safe. In the future, models will be able to subvert basic control techniques, and control evals could measure that ability and safeguards' effectiveness.

What companies should do

For now, companies should focus on researching risks and safety techniques.^[✲]

Once models pose scheming risks by default, in theory there are several possible paths to safety, including alignment audits, but I think that's dubious and control is best for the first such models.

Security

Things that can be stolen include model weights, other code, and secrets. Here I focus on model weights.

If American model weights are stolen (for future powerful models), they could be used by China, used by normal criminals, or shared publicly. Use by criminals is bad for the normal reasons (potentially very bad depending on capability level). Use by China is extremely bad because (1) it boosts their AI R&D^[✲] and (2) it boosts their military tech, and due to both of those (3) it makes it harder for Western developers to trade off speed for safety at a critical time. Public weights enable both of the previous, plus throws away capabilities lead and boosts everyone else's future AI R&D and enables everyone else to do misuse.^[✲]

Model weights currently seem very vulnerable. Security will likely improve somewhat but companies aren't planning to secure future models.

Relevant capabilities

Bio and cyber capabilities (as well as military tech development capabilities) are relevant for misuse and geopolitics. AI R&D capability is relevant for accelerating adversaries (or accelerating everyone if proliferated widely) and thus geopolitics.

What companies should do

A company should sprint to achieve SL5 optionality—the ability to quickly turn on strong security—by early 2027. Or if that's too costly given that others are rushing and have worse security, the company should try to coordinate — make sure they're at least slightly better than others and that fact is demonstrable, and loudly say that they want to do more but that they expect others would be worse.

Getting started

Three threats

Misuse (via API)

Relevant capabilities

What companies should do

Misalignment

Relevant capabilities

What companies should do

Security

Relevant capabilities

What companies should do