Metanym Game: Self-Contained LLM Benchmark for Structural Intelligence

The Metanym Game introduces a contamination-resistant benchmark for LLMs that measures structural intelligence through dynamic, on-the-fly analogy creation. A singular value decomposition of evaluator ratings reveals both generation and judging competence, with factual accuracy correlating strongly to GPQA Diamond at r = 0.92. Judging is a rarer skill: top generators are average judges, while top judges produce mid-tier outputs, and the strongest models earn seats in a council that self-rates and governs the benchmark.