AI Safety Claims Analysis

Report results

Publish a report on your evals, describing the evals and sharing the results. For each eval, specify which version of the model was used, and include details on elicitation and how scores are reported. Ideally the company would publish a small number of random responses/trajectories (unless they're sensitive, like perhaps in bio) to help readers understand the task, see how the models behave, and notice whether spurious failures are common.

(Open-ended exploration is fine. There's no great way to report results. If the red-teamers write a report, ideally the company would quote the summary, redacted if appropriate.)

Share info on evals

Share info such that external observers can tell how high-quality and difficult the eval is and thus how to interpret results. Publish methodology and ideally the tasks.^[✲]

Open-source

Companies can run external evals, including:

Virology Capabilities Test (not open-source but available to AI companies)
LAB-Bench

Cyber

Cybench

AI R&D

Scheming capabilities

Evaluating Frontier Models for Stealth and Situational Awareness (repo)

Companies can have third-party evaluators run evals, including Apollo's Scheming reasoning evaluations.

Interpretation

Interpret eval results well and explain the interpretation. How do results translate to real-world capabilities? How do they translate to risks and required safeguards or safety cases? What eval results would indicate that the model has dangerous capabilities; what results would make the company uncertain? Why? Provide context, such as expert human performance.

Ideally preregister interpretation, publishing it in advance.

Ideally forecast eval results (as a function of time and optionally effective compute), and publish forecasts as a function of time.

Elicitation

Do good elicitation. Publish details or otherwise demonstrate that your elicitation is good.

Aspects of good elicitation:

General post-training (for "instruction following, tool use, and general agency")
Helpful-only + no inference-time mitigations
Scaffolding, prompting, chain of thought: the company should mention some details so that observers can understand how powerful and optimized the scaffolding is. Open-sourcing scaffolding or sharing techniques is supererogatory. If the company does not share its scaffolding, it could show that the scaffolding is effective by running the same model with the most powerful relevant open-source scaffolding, or running the model on evals like SWE-bench where existing scaffolding provides a baseline, and comparing the results.
Tools: often enable internet browser and code interpreter; enable other tools depending on the field or task
Permit many attempts, best-of-n, or self-consistency: sometimes this makes sense given the field or threat model; e.g. in coding it's often fine for the model to require several attempts. Additionally, this can provide a safety buffer or indicate near-future capability levels (including levels that will be reached via post-training enhancements, without retraining).
Fix spurious failures: look at transcripts to identify why the agent fails and whether the failures are due to easily fixable mistakes or issues with the task or infrastructure. If so, fix them, or at least quantify how common they are. Example.
Bonus: post-train on similar tasks

Using easier evals or weak elicitation is sometimes fine, if done right.^[✲]

(Ideal elicitation quality and safety buffer depends on the threat: misuse during intended deployment vs the model being stolen, being leaked, or escaping. If the former, it also depends on users' depth of access and whether post-deployment-correction is possible.)

(Companies may keep elicitation techniques secret to preserve their advantage, but sharing such information seems fine in terms of safety. But for now this is moot: companies' elicitation in evals for dangerous capabilities seems quite basic, not using secret powerful techniques.)

Accountability for eval results

The company could have an external auditor observe the evaluation process, read the eval report, and publicly comment on issues with the process or the report.

The company could also have load-bearing evals be performed by external evaluators.

External evaluators

When sharing with external evaluators, doing so well.

What access to share? At least a basically-final model (with most of the gains from post-training) helpful-only and no-inference-time-mitigations. Bonus: good fine-tuning & RL.
Let them publish their results, and ideally incorporate results into risk assessment

Ideally offer to share deep access with external evaluators including UK AISI, CAISI, METR, and Apollo. This does not need to be pre-deployment.

Accountability for interpretation

The company could get accountability for whether the reported evals are high-quality and support the company's claims about safety by having external experts publicly comment on the reported evals, their implications, and the company's claims. This is an unusual kind of action but isn't actually difficult; the external experts just need a copy of the eval report (and the company needs to make the eval report contain the information necessary for interpretation). Companies could also share private information with the external experts to inform their assessment.

(Have good evals)

Actually measure the model's level in a particular capability (as opposed to accidentally measuring a different capability, or having lots of impossible/incorrect tasks such that even a perfect model would get only a moderate score on the eval), and have the capability be actually important or relevant to the threat model.

Notes

Evaluating for capabilities that enable misuse is nice. I'm particularly concerned about risks from scheming during internal deployment, relative to risks from misuse, but the relevant capabilities are less clear and may be harder to measure.

Regardless, evals only matter if they help lead to adequate responses when necessary.

This page is about capability evals. On misalignment propensity evals, related but distinct remarks apply. On determining whether a mitigated model is safe to deploy in a particular way, you generally have to determine whether the mitigations are very robust; this page is not about that.

This page is about all powerful models, even if they're not externally deployed, since risks arise from training powerful models and from deploying them internally. Ideally companies would share information on internal model use in addition to capabilities.