CALIBER introduces a method that elicits and supervises confidence estimates at two stages: before and after reasoning. It reduces Expected Calibration Error by 52.5% on BigMathDigits for a 7B model, achieving the best Brier score and AUROC, and performs best on out-of-distribution benchmarks like GPQA and TriviaQA.
arxiv
arXiv cs.CL
·
1d ago
·
research
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
from English
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| GPQA Diamond | 30B model | 0% |
| BIG-Bench Hard | 7B model | 52.5% |
| BIG-Bench Hard | 30B model | 0% |