Extended Results
For models in self-play, pass @1Multi-module missions (10 missions)
| Model | Missions solved?Full multi-module missions defused (pass@1). | Modules solved?Share of attempted modules solved. | Any module solved?Missions where at least one module was solved. | Avg. strikes?Mean strikes incurred per mission (3 strikes = explosion). | Timeouts?Missions that ran out the live countdown. |
|---|---|---|---|---|---|
| 0% | 15% | 50% | 2.3 / 3 | 40% | |
| 0% | 12% | 30% | 1.2 / 3 | 90% | |
| 0% | 9% | 30% | 1.7 / 3 | 50% | |
| 0% | 3% | 10% | 1.4 / 3 | 70% | |
| 0% | 6% | 20% | 0.8 / 3 | 80% | |
| Human players | 25% | 60% | 93% | 1.1 / 3 | 55% |
Single-module missions (10 missions per module type)
Score by model × task — each column is a bomb module the agents must defuse; cells show the share of attempts solved (pass@1). New modules added to the benchmark appear here automatically.
0%
100%
| Model |
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| 30% | 20% | 40% | 0% | 0% | 0% | 0% | 0% | 0% | 10% | 0% | |
| 40% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
| 30% | 10% | 30% | 0% | 20% | 10% | 10% | 0% | 0% | 20% | 0% | |
| 40% | 20% | 20% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
| 30% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Multi-module missions (10 missions)
| Model | Missions solved?Full multi-module missions defused (pass@1). | Modules solved?Share of attempted modules solved. | Any module solved?Missions where at least one module was solved. | Avg. strikes?Mean strikes incurred per mission (3 strikes = explosion). | Timeouts?Missions that ran out the live countdown. |
|---|---|---|---|---|---|
| 10% | 29% | 50% | 2.2 / 3 | 30% | |
| 10% | 21% | 50% | 2.5 / 3 | 20% | |
| 0% | 15% | 40% | 1.6 / 3 | 60% | |
| 0% | 12% | 40% | 1.8 / 3 | 70% | |
| 0% | 12% | 30% | 1.4 / 3 | 60% |
Single-module missions (10 missions per module type)
Score by model × task — each column is a bomb module the agents must defuse; cells show the share of attempts solved (pass@1). New modules added to the benchmark appear here automatically.
0%
100%
| Model |
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| 30% | 20% | 40% | 0% | 0% | 0% | 0% | 0% | 0% | 10% | 0% | |
| 40% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
| 30% | 10% | 30% | 0% | 20% | 10% | 10% | 0% | 0% | 20% | 0% | |
| 40% | 20% | 20% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | |
| 30% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Contamination Checks
No partner, no manualAblations surfacing previous exposure to the game during training. Results indicate the extent to which models can leverage parametric knowledge.
| Model | Single Agent | Expert VQA |
|---|---|---|
| 26% | 22% | |
| 24% | 20% | |
| 12% | 36% | |
| 11% | 22% | |
| 12% | 10% | |
| Random baseline | 3% | 21% |
