Benchmark · reasoning

BIG-Bench Hard

saturated 2 results 2 models
0 14 28 42 56 2026-06-24 7B model · 52.5 · 2026-06-24 30B model · 0 · 2026-06-24
7B model 30B model
Timeline
  1. 2026-06-24 7B model 52.5% CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
  2. 2026-06-24 30B model 0.0% CALIBER: Calibrating Confidence Before and After Reasoning in Language Models