A pilot benchmark on evidence depth for LLM calls argues that calibration must extend beyond factual correctness to include epistemic contamination and framing leakage. The study defines 'k*' as the evidence-saturation point where reliability is maximized, distinguishing it from standard retriever top-k or state-density metrics.

  • Correctness-only calibration can be blind; in a dual-instrumented sweep, factual correctness remained flat at 1.000 for every k ≥ 1 while contamination signals reached 0.05–0.08.
  • The reliability-optimal k* varies across five task types: factual recall, multi-hop, state tracking, conflict resolution, and constraint following.
  • Fixed defaults like top-3, top-5, or filling the context window are discouraged in favor of measuring k* per model, task type, context format, and reliability axis.

This approach helps RAG systems, long-memory agents, and model routers treat evidence depth as a measured deployment parameter rather than a guess, improving auditability and cost control.