A two-agent benchmark: one model sees the bomb, one reads the manual. They must talk to defuse it.

GPTNT

Can two AI agents talk each other through defusing a bomb?

GPTNT is a benchmark built on the cooperative game Keep Talking and Nobody Explodes, where two agents must coordinate to defuse procedurally generated bombs against a live countdown. One sees the bomb but not the manual; the other holds the manual but can't see the bomb. Neither can succeed alone — only effective, real-time, asynchronous communication defuses the bomb.

Not one of the state-of-the-art models we test defuses a single bomb in real time — a bar that human players clear.

Amit Parekh*, Sabrina McCallum*, Kareem Al-Hasan*, Malvina Nikandrou, Alessandro Suglia, Ioannis Konstas

Heriot Watt University · University of Edinburgh

* Equal contribution

Leaderboard

For models in self-play, pass @1 · See extended results →

Replay

Of actual games played by models we tested
Defuser view of the bomb
Current turn:Expert
DefuserExpert
Chat only
1:15
5:00
Step 6 / 27Strikes 1 / 3

FAQ

about the benchmark
+What is GPTNT?
GPTNT is a benchmark for real-time collaboration between multimodal models, built on the cooperative game Keep Talking and Nobody Explodes. Two agents must coordinate to defuse procedurally generated bomb puzzles against a live countdown. One agent is the Defuser: it has access to the bomb but not the instructions for defusing it. The other agent is the Expert: it holds the instructions manual but cannot see or manipulate the bomb.
+Why is this hard for AI agents?
GPTNT combines conditions that are usually studied in isolation: asymmetric information, real-time asynchronous action, visual grounding, procedural reasoning, and sustained multi-turn communication — all at once. Current models break down on tracking state across turns, acting within the time budget, handling ambiguous descriptions, and recovering from mistakes.
+What do you release, and do I need the game?
We release everything needed to run the benchmark: the game mod and microservice framework, the fixed suite of mission configurations, the processed manual, the role-specific system prompts, and the single-agent diagnostic evaluations. You do need your own copy of Keep Talking and Nobody Explodes ($14.99, Windows/macOS/Linux). We do not distribute the game or any of its source, as that would violate its license.
+What are the "async" and "sync" modes?
In async mode, the bomb clock advances in real time and never pauses while a model generates; every generated token consumes game time. The two agents run in parallel, so the Expert can reason and speak while the Defuser acts. sync mode freezes the clock during generation and advances a fixed increment per turn, separating the ability to reason accurately from the ability to reason efficiently. It also lowers the barrier to entry by matching the turn-based interaction current models assume.
+What's the difference between a strikeout and a timeout?
failed mission ends in one of two terminal states: a strikeout occurs when the Defuser makes three mistakes, leading to three strikes, or a timeout when the timer expires before all modules are solved. These are diagnostically different failures: a model can be careful-but-slow or fast-but-reckless.
+How do you separate real collaboration from memorised solutions?
Because the game predates current models, parts of its manual and puzzle logic may appear in training data. As a contamination check, we run dedicated single-agent evaluations without providing the manual to expose what a model already knows.
+Will the benchmark get solved and retired?
GPTNT is designed to stay ahead of improving models. New missions can be procedurally generated and reconfigured by varying the time limit and the count, type, and placement of modules. New puzzle types from the game's active modding community can be added without rebuilding the framework, keeping the task space ahead of any fixed training distribution.

Citation

cite this benchmark
bibtex
@misc{gptnt2026,
  title  = {GPTNT: Keep Talking and Nobody Explodes, for AI},
  author = {GPTNT Team},
  year   = {2026},
  note   = {A two-agent benchmark for real-time multimodal collaboration},
  url    = {https://gptnt.ai}
}