Related work

Evaluating AI companies' evals

Nothing systematic but:

Ryan Greenblatt's comment on CBRN capabilities in Spring 2025 (2025)
Luca Righetti's OpenAI's CBRN tests seem unclear (2024) and o1's new CBRN report looks (a fair bit) better (2024)
Anson Ho and Arden Berg's Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons? (2025)
Zvi Mowshowitz's The o1 System Card Is Not About o1 (2024)

Best practices

My page
Guidelines for capability elicitation (METR 2024)
Challenges in evaluating AI systems (Anthropic 2023)
Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024)
A Safe Harbor for AI Evaluation and Red Teaming (Longpre et al. 2024)
Dangerous capability tests should be harder (Righetti 2024)