arxiv arXiv cs.CL · 1d ago · research

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

from English

CALIBER introduces a method that elicits and supervises confidence estimates at two stages: before and after reasoning. It reduces Expected Calibration Error by 52.5% on BigMathDigits for a 7B model, achieving the best Brier score and AUROC, and performs best on out-of-distribution benchmarks like GPQA and TriviaQA.

Importance 2/3 arXiv cs.CL Evaluation & benchmarks Reasoning models

Benchmarks

Benchmark	Model	Score
GPQA Diamond	30B model	0%
BIG-Bench Hard	7B model	52.5%
BIG-Bench Hard	30B model	0%

Read original