AI Safety Claims Analysis

These four pages are new and experimental. Please give me feedback (using the button in the bottom-right) on issues or what other things I should focus on here.

AI companies claim that their models don't have dangerous capabilities on the basis of tests called model evals. Two examples of evals are the Virology Capabilities Test, a set of 300 very difficult virology questions, and SWE-bench Verified, a set of 500 real-world coding tasks.

Distinguish two aspects of how a model behaves when asked to do a task:

Capability: how well the model can do the thing if it tries.
Propensity: how likely the model is to try.

Capability is more important; I focus on it.^[✲]

If everything goes right, then an eval result tells you how capable the AI is on that kind of task. From this, you can infer what risks it presents and what safety measures are appropriate.

Unfortunately, there are many ways things can go wrong, and evals often substantially underestimate the model's capabilities. Questions are impossible because they omit necessary information or the system for checking answers is broken, or the model solves the problem but formats its answer incorrectly, or the model isn't given crucial tools, or the question is asked in a way that causes the model to refuse or not try hard or give up too quickly. Sophisticated users in the real world won't have these issues — they'll give the model tools, notice and respond if it often gets stuck for a dumb reason, and fix other spurious failures.^[✲]

Besides doing the evals well, using evals to determine that a model is safe also requires correctly determining what capabilities would be dangerous, correctly interpreting eval results, and responding well if the evals indicate danger. A serious mistake in any single part of the process can invalidate the whole plan.

In this site, I consider five categories of evals: biology^[✲] capabilities, offensive cyber capabilities, AI R&D capabilities, scheming capabilities, and misalignment propensity. This breakdown and focus is standard among the AI companies.

Getting started

Intro to evals