A two-agent benchmark: one model sees the bomb, one reads the manual. They must talk to defuse it.

Extended Results

For models in self-play, pass @1

Multi-module missions (10 missions)

ModelMissions solved?Full multi-module missions defused (pass@1).Modules solved?Share of attempted modules solved.Any module solved?Missions where at least one module was solved.Avg. strikes?Mean strikes incurred per mission (3 strikes = explosion).Timeouts?Missions that ran out the live countdown.
AnthropicSonnet 4.6 0% 15% 50% 2.3 / 3 40%
OpenAIGPT-5.2 0% 12% 30% 1.2 / 3 90%
GoogleGemini 3 Flash 0% 9% 30% 1.7 / 3 50%
InternVLInternVL 3.5 (38B) 0% 3% 10% 1.4 / 3 70%
QwenQwen 3.5 (27B) 0% 6% 20% 0.8 / 3 80%
Human players 25% 60% 93% 1.1 / 3 55%

Single-module missions (10 missions per module type)

Score by model × task — each column is a bomb module the agents must defuse; cells show the share of attempts solved (pass@1). New modules added to the benchmark appear here automatically. 0% 100%
Model Wires Button Keypad Simon Says Who's On First Memory Morse Code Complicated Wires Wire Sequence Maze Passwords
AnthropicSonnet 4.6 30%20%40%0%0%0%0%0%0%10%0%
OpenAIGPT-5.2 40%0%0%0%0%0%0%0%0%0%0%
GoogleGemini 3 Flash 30%10%30%0%20%10%10%0%0%20%0%
InternVLInternVL 3.5 (38B) 40%20%20%0%0%0%0%0%0%0%0%
QwenQwen 3.5 (27B) 30%0%0%0%0%0%0%0%0%0%0%

Contamination Checks

No partner, no manual

Ablations surfacing previous exposure to the game during training. Results indicate the extent to which models can leverage parametric knowledge.

Model Single AgentExpert VQA
AnthropicSonnet 4.6 26% 22%
OpenAIGPT-5.2 24% 20%
GoogleGemini 3 Flash 12% 36%
InternVLInternVL 3.5 (38B) 11% 22%
QwenQwen 3.5 (27B) 12% 10%
Random baseline 3% 21%