The authors propose KbSD, a framework that addresses reward sparsity in agentic search by using dense token-level supervision and quadrant-adaptive optimization to calibrate when models should trust parametric memory versus retrieved evidence. This approach utilizes an information-asymmetric self-distillation process where a hint-augmented teacher generates calibrated reasoning demonstrations for a student model without requiring a larger external model.
- KbSD employs dense token-level supervision alongside outcome-level sparse rewards to guide the reasoning process across different knowledge states.
- The framework constructs a hint-augmented teacher that receives explicit signals on parametric certainty, retrieval quality, and ground-truth answers.
- A quadrant-adaptive distillation objective applies reverse KL for concentrated integration, forward KL for diverse refusal, and Pareto-optimal bidirectional KL for asymmetric quadrants.
- Experiments demonstrate consistent improvements in task accuracy and hallucination mitigation over strong baselines, particularly in challenging quadrants where sparse rewards are least informative.
This method helps users by enabling large language models to make more calibrated decisions during dynamic retrieval, effectively reducing hallucinations and improving performance in complex agentic search scenarios.