The Complexity Ceiling Benchmark (CCB) evaluates how language model reasoning decays as the required sequential steps increase, fixing semantic content while varying task depth from 5 to 50. The study reveals consistent geometric per-step decay across three distinct regimes: grounded spatial state-tracking, abstract symbolic pointer manipulation, and transitive relational inference.

  • Across 6,000 trials on five frontier and open-weight LLMs, the strongest models retained a success probability greater than 0.92 at N=50 for the first two regimes.
  • In transitive relational inference, every model collapsed by N=5, with the best model's 50%-success horizon limited to approximately 4.7 steps.
  • A trace-level metric (TFBC) indicates that 14.5% of correct answers were reached via incorrect intermediate reasoning.
  • Forced verbose state-tracking did not improve performance (McNemar p=1.000), and the mean step at which reasoning diverges predicts accuracy better than parameter count.

CCB and its geometric decay model reduce a model's long-horizon reasoning profile to one interpretable number per task family, providing a standardized method for assessing reasoning limits.