OpenAI
Company information
Company's most powerful model
o3
Deep Research released Feb 2, 2025; o3 released Apr 16; Codex released May 16; o3-pro released Jun 10
Deep Research system card published Feb 25, 2025; o3 system card published Apr 16; Codex addendum published May 16[✲]
Training and internal deployment: dates not reported
This page is about o3 and the major systems based on it: Deep Research, Codex, and o3-pro. I focus on the o3 system card since Deep Research's was earlier and OpenAI didn't do Preparedness evals for Codex or o3-pro; its justification is that these systems are based on the o3 model.
OpenAI says the models don't have dangerous capabilities. It does some good evals. But the models perform well enough that it's not clear why OpenAI believes they don't have dangerous capabilities, and OpenAI doesn't say why it believes this or what would change its mind. This is particularly egregious in bio, where the models match or beat humans on all comparisons to humans and almost saturate the one eval without human comparison. OpenAI offers no explanation for why it believes the models can't meaningfully uplift novices.
o3 evaluation categories
Click on a category to see my analysis and the relevant part of the company's eval report.
Chem & bio
OpenAI says the models don't meaningfully uplift novices in creating biothreats. But it provides no justification for this claim, and its evals suggest that the models are more capable than human experts.
Cyber
OpenAI says the models don't have dangerous cyber capabilities. It doesn't really say why it believes they're safe or what would change its mind. It does reasonable evals, but the models do well on the most challenging CTFs.
AI R&D
OpenAI says the models aren't as helpful as a strong research assistant. Some of its evals do suggest that, but the models might be substantially underelicited and might be close to that level, and it's not clear what would change OpenAI's mind.
Scheming capabilities
OpenAI's bottom line is "o3 and o4-mini are capable of in-context scheming and strategic deception, although Apollo Research believes that it is unlikely that o3 or o4-mini would cause catastrophic harm due to scheming."
Misalignment propensity
OpenAI says Apollo "found that OpenAI o3 and o4-mini score lower than o1 on in-context scheming evaluations, but they still exhibit deceptive tendencies against their developers and users, such as sandbagging evaluations or sabotaging tasks when they have plausible deniability and it helps to further their goals."