The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Traditional evaluations reduce learning to a single aggregate score, obscuring how well knowledge from one example generalizes to others. The authors introduce the Generalization Spectrum, an evaluation framework that measures per-sample generalization by tracking performance across test variants with increasing transfer distance. These variants range from exact recall to implementation transfer across languages and context transfer under narrative reframing. The framework is instantiated on competitive programming using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. Comparisons of canonical learning paradigms show that Reinforcement Learning converts memorization into near-transfer more efficiently than Supervised Fine-Tuning baselines. In-context learning exhibits strong but correspondence-dependent transfer capabilities in this context. Diagnostic profiles reveal that local gains do not necessarily expand the generalization radius for all methods. Specifically, abstractions and hints mainly lift local transfer, while Reference SFT preserves a stronger far-transfer tail than RFT. Furthermore, self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.