Know Before You Fetch: Calibrated Retrieval-Budget Allocation for Retrieval-Augmented Generation

This article introduces an adaptive RAG framework that allocates retrieval budgets by calibrating sequence log-probability and prefix-logit uncertainty signals into probabilities of correctness. The system decides whether to answer closed-book, retrieve a compact context (k=1), retrieve a full context (k=5), or abstain based on these calibrated probabilities.

Diagnostic out-of-fold calibration significantly improves probability quality, reducing ECE from 0.275 to 0.062 on TriviaQA and from 0.643 to 0.009 on Natural Questions.
Graded retrieval improves full-context and passage-budget frontiers for both the proposed signal and TARG-style prefix entropy/margin.
Held-out threshold experiments identify deployable operating points for different QA tasks including TriviaQA, Natural Questions, and MS MARCO.
A measured cost model shows that gating is not universally faster, increasing latency by about 27% on Qwen3-8B while saving about 8% on Qwen3-32B at matched-accuracy frontiers.

The authors consider this important because calibrated confidence serves as a reusable interface for allocating retrieval budget under specific task and system constraints, offering a nuanced view of adaptive RAG efficiency.