OpenAI

Company information

Company's most powerful model

o3

Deep Research released Feb 2, 2025; o3 released Apr 16; Codex released May 16; o3-pro released Jun 10

Deep Research system card published Feb 25, 2025; o3 system card published Apr 16; Codex addendum published May 16[✲]

Training and internal deployment: dates not reported

This page is about o3 and the major systems based on it: Deep Research, Codex, and o3-pro. I focus on the o3 system card since Deep Research's was earlier and OpenAI didn't do Preparedness evals for Codex or o3-pro; its justification is that these systems are based on the o3 model.

OpenAI says the models don't have dangerous capabilities. It does some good evals. But the models perform well enough that it's not clear why OpenAI believes they don't have dangerous capabilities, and OpenAI doesn't say why it believes this or what would change its mind. This is particularly egregious in bio, where the models match or beat humans on all comparisons to humans and almost saturate the one eval without human comparison. OpenAI offers no explanation for why it believes the models can't meaningfully uplift novices.

Elicitation checklist:🟡 🟡
Accountability checklist:

o3 evaluation categories

Click on a category to see my analysis and the relevant part of the company's eval report.

Chem & bio

OpenAI says the models don't meaningfully uplift novices in creating biothreats. But it provides no justification for this claim, and its evals suggest that the models are more capable than human experts.

Cyber

OpenAI says the models don't have dangerous cyber capabilities. It doesn't really say why it believes they're safe or what would change its mind. It does reasonable evals, but the models do well on the most challenging CTFs.

AI R&D

OpenAI says the models aren't as helpful as a strong research assistant. Some of its evals do suggest that, but the models might be substantially underelicited and might be close to that level, and it's not clear what would change OpenAI's mind.

Scheming capabilities

OpenAI's bottom line is "o3 and o4-mini are capable of in-context scheming and strategic deception, although Apollo Research believes that it is unlikely that o3 or o4-mini would cause catastrophic harm due to scheming."

Misalignment propensity

OpenAI says Apollo "found that OpenAI o3 and o4-mini score lower than o1 on in-context scheming evaluations, but they still exhibit deceptive tendencies against their developers and users, such as sandbagging evaluations or sabotaging tasks when they have plausible deniability and it helps to further their goals."

o3 safety case categories

This section is new and experimental. Click on a category to see my analysis. Preventing misuse via API (i.e., when the user doesn't control the model weights) is relatively straightforward, especially for bio; scheming risk prevention will eventually be important and improving it from current levels is tractable; security will eventually be important and we basically know what to do but it's very costly. I'm interested in feedback, especially on whether this content is helpful to you or what questions you wish I answered here instead.

Misuse (via API) prevention

OpenAI says the models don't have dangerous capabilities, so safeguards are unnecessary. But eval results seem to contradict that claim.

OpenAI's High threshold in bio or cyber capabilities triggers its High misuse standard. The thresholds are fine in the abstract but that's inadequate since OpenAI seems to interpret the eval results poorly. The misuse standards are supposed to "sufficiently minimize the associated risk of severe harm"; OpenAI doesn't say how it will tell whether a particular set of safeguards suffices.

Scheming risk prevention

OpenAI says the models don't have dangerous capabilities, so safeguards are unnecessary.

It's unclear which of OpenAI's capability thresholds trigger its misalignment standards. Misalignment standards are supposed to "sufficiently minimize the risk associated with a misaligned model circumventing human control and oversight and executing severe harms." OpenAI mentions some good paths to safety and some confused paths, so overall the standard is concerning.

Security

OpenAI says the models don't have dangerous capabilities, so safeguards are unnecessary. But eval results seem to contradict that claim.

OpenAI's High threshold in bio, cyber, or AI R&D capabilities triggers its High security standard. The thresholds are fine in the abstract but that's inadequate since OpenAI seems to interpret the eval results poorly. Crucially, the High standard is vague and inadequate; it will probably leave OpenAI's model weights vulnerable to many actors.