The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

The Complexity Ceiling Benchmark (CCB) evaluates how language model reasoning decays as the required sequential steps increase, fixing semantic content while varying task depth from 5 to 50. The study reveals consistent geometric per-step decay across three distinct regimes: grounded spatial state-tracking, abstract symbolic pointer manipulation, and transitive relational inference.

Across 6,000 trials on five frontier and open-weight LLMs, the strongest models retained a success probability greater than 0.92 at N=50 for the first two regimes.
In transitive relational inference, every model collapsed by N=5, with the best model's 50%-success horizon limited to approximately 4.7 steps.
A trace-level metric (TFBC) indicates that 14.5% of correct answers were reached via incorrect intermediate reasoning.
Forced verbose state-tracking did not improve performance (McNemar p=1.000), and the mean step at which reasoning diverges predicts accuracy better than parameter count.

CCB and its geometric decay model reduce a model's long-horizon reasoning profile to one interpretable number per task family, providing a standardized method for assessing reasoning limits.