Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability

The authors introduce Age of LLM, a turn-based 1v1 benchmark where two large language models compete on a 13x7 grid to destroy an enemy base under conditions of fog of war and full diplomacy. This private engine mitigates data contamination by using fresh random map seeds and opponents for each match.

The benchmark evaluates 15 reasoning models across 54 matches and 5,258 actions.
Nuclear rushes dominate outcomes (78% on the rules-coherent sub-corpus), driven by mechanical launch rules rather than cognitive deterrence failure.
Military conquest is rare but faster (12.3 vs 18.9 turns), while diplomacy is prolific yet rarely consummated.
Approximately 58% of illegal actions are fog or state errors, serving as a measure of belief-tracking.
A weak link associates reliability with winning, though the corpus is too small for definitive ranking.

The turn-by-turn traces provide a lens into how LLMs reason under adversarial uncertainty, revealing aspects of belief-tracking and spontaneous deception.