Report results

Publish a report on your evals, describing the evals and sharing the results. For each eval, specify which version of the model was used, and include details on elicitation and how scores are reported. Ideally the company would publish a small number of random responses/trajectories (unless they're sensitive, like perhaps in bio) to help readers understand the task, see how the models behave, and notice whether spurious failures are common.

(Open-ended exploration is fine. There's no great way to report results. If the red-teamers write a report, ideally the company would quote the summary, redacted if appropriate.)

Share info on evals

Share info such that external observers can tell how high-quality and difficult the eval is and thus how to interpret results. Publish methodology and ideally the tasks.[✲]

Open-source

Companies can run external evals, including:

Companies can have third-party evaluators run evals, including Apollo's Scheming reasoning evaluations.

Interpretation

Interpret eval results well and explain the interpretation. How do results translate to real-world capabilities? How do they translate to risks and required safeguards or safety cases? What eval results would indicate that the model has dangerous capabilities; what results would make the company uncertain? Why? Provide context, such as expert human performance.

Ideally preregister interpretation, publishing it in advance.

Ideally forecast eval results (as a function of time and optionally effective compute), and publish forecasts as a function of time.

Elicitation

Do good elicitation. Publish details or otherwise demonstrate that your elicitation is good.

Aspects of good elicitation:

Using easier evals or weak elicitation is sometimes fine, if done right.[✲]

(Ideal elicitation quality and safety buffer depends on the threat: misuse during intended deployment vs the model being stolen, being leaked, or escaping. If the former, it also depends on users' depth of access and whether post-deployment-correction is possible.)

(Companies may keep elicitation techniques secret to preserve their advantage, but sharing such information seems fine in terms of safety. But for now this is moot: companies' elicitation in evals for dangerous capabilities seems quite basic, not using secret powerful techniques.)

Accountability for eval results

The company could have an external auditor observe the evaluation process, read the eval report, and publicly comment on issues with the process or the report.

The company could also have load-bearing evals be performed by external evaluators.

External evaluators

When sharing with external evaluators, doing so well.

Ideally offer to share deep access with external evaluators including UK AISI, CAISI, METR, and Apollo. This does not need to be pre-deployment.

Accountability for interpretation

The company could get accountability for whether the reported evals are high-quality and support the company's claims about safety by having external experts publicly comment on the reported evals, their implications, and the company's claims. This is an unusual kind of action but isn't actually difficult; the external experts just need a copy of the eval report (and the company needs to make the eval report contain the information necessary for interpretation). Companies could also share private information with the external experts to inform their assessment.

(Have good evals)

Actually measure the model's level in a particular capability (as opposed to accidentally measuring a different capability, or having lots of impossible/incorrect tasks such that even a perfect model would get only a moderate score on the eval), and have the capability be actually important or relevant to the threat model.

Notes

Evaluating for capabilities that enable misuse is nice. I'm particularly concerned about risks from scheming during internal deployment, relative to risks from misuse, but the relevant capabilities are less clear and may be harder to measure.

Regardless, evals only matter if they help lead to adequate responses when necessary.

This page is about capability evals. On misalignment propensity evals, related but distinct remarks apply. On determining whether a mitigated model is safe to deploy in a particular way, you generally have to determine whether the mitigations are very robust; this page is not about that.

This page is about all powerful models, even if they're not externally deployed, since risks arise from training powerful models and from deploying them internally. Ideally companies would share information on internal model use in addition to capabilities.