Benchmark · reasoning

GPQA Diamond

1 results 1 models
0 0 0 0 0 2026-06-24 30B model · 0 · 2026-06-24
30B model
Timeline
  1. 2026-06-24 30B model 0.0% CALIBER: Calibrating Confidence Before and After Reasoning in Language Models