Age of LLM introduces a turn-based 1v1 benchmark where two LLMs compete on a 13x7 grid under fog of war, full diplomacy, and strict JSON reliability rules. Findings show the nuclear rush dominates, diplomacy is prolific but rarely succeeds, and illegal actions reveal belief-tracking errors, with a weak link between reliability and victory. The corpus is small and unbalanced, and the results offer a preliminary view of LLM reasoning under adversarial uncertainty.
Age of LLM: Benchmark for LLM Reasoning and Diplomacy
from English