GPTNT — Extended Results

Extended Results

For models in self-play, pass @1

Multi-module missions (10 missions)

Model	Missions solved?Full multi-module missions defused (pass@1).	Modules solved?Share of attempted modules solved.	Any module solved?Missions where at least one module was solved.	Avg. strikes?Mean strikes incurred per mission (3 strikes = explosion).	Timeouts?Missions that ran out the live countdown.
Sonnet 4.6	0%	15%	50%	2.3 / 3	40%
GPT-5.2	0%	12%	30%	1.2 / 3	90%
Gemini 3 Flash	0%	9%	30%	1.7 / 3	50%
InternVL 3.5 (38B)	0%	3%	10%	1.4 / 3	70%
Qwen 3.5 (27B)	0%	6%	20%	0.8 / 3	80%
Human players	25%	60%	93%	1.1 / 3	55%

Single-module missions (10 missions per module type)

Score by model × task — each column is a bomb module the agents must defuse; cells show the share of attempts solved (pass@1). New modules added to the benchmark appear here automatically. 0% 100%

Model	Wires	Button	Keypad	Simon Says	Who's On First	Memory	Morse Code	Complicated Wires	Wire Sequence	Maze	Passwords
Sonnet 4.6	30%	20%	40%	0%	0%	0%	0%	0%	0%	10%	0%
GPT-5.2	40%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
Gemini 3 Flash	30%	10%	30%	0%	20%	10%	10%	0%	0%	20%	0%
InternVL 3.5 (38B)	40%	20%	20%	0%	0%	0%	0%	0%	0%	0%	0%
Qwen 3.5 (27B)	30%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%

Multi-module missions (10 missions)

Model	Missions solved?Full multi-module missions defused (pass@1).	Modules solved?Share of attempted modules solved.	Any module solved?Missions where at least one module was solved.	Avg. strikes?Mean strikes incurred per mission (3 strikes = explosion).	Timeouts?Missions that ran out the live countdown.
Sonnet 4.6	10%	29%	50%	2.2 / 3	30%
GPT-5.2	10%	21%	50%	2.5 / 3	20%
Gemini 3 Flash	0%	15%	40%	1.6 / 3	60%
InternVL 3.5 (38B)	0%	12%	40%	1.8 / 3	70%
Qwen 3.5 (27B)	0%	12%	30%	1.4 / 3	60%

Single-module missions (10 missions per module type)

Score by model × task — each column is a bomb module the agents must defuse; cells show the share of attempts solved (pass@1). New modules added to the benchmark appear here automatically. 0% 100%

Model	Wires	Button	Keypad	Simon Says	Who's On First	Memory	Morse Code	Complicated Wires	Wire Sequence	Maze	Passwords
Sonnet 4.6	30%	20%	40%	0%	0%	0%	0%	0%	0%	10%	0%
GPT-5.2	40%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
Gemini 3 Flash	30%	10%	30%	0%	20%	10%	10%	0%	0%	20%	0%
InternVL 3.5 (38B)	40%	20%	20%	0%	0%	0%	0%	0%	0%	0%	0%
Qwen 3.5 (27B)	30%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%

Contamination Checks

No partner, no manual

Ablations surfacing previous exposure to the game during training. Results indicate the extent to which models can leverage parametric knowledge.

Model	Single Agent	Expert VQA
Sonnet 4.6	26%	22%
GPT-5.2	24%	20%
Gemini 3 Flash	12%	36%
InternVL 3.5 (38B)	11%	22%
Qwen 3.5 (27B)	12%	10%
Random baseline	3%	21%