Benchmark · reasoning
BIG-Bench Hard
saturated
2 results
2 models
7B model
30B model
Timeline
-
2026-06-24
7B model
52.5%
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
-
2026-06-24
30B model
0.0%
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models