ObviousBench is a new benchmark designed to evaluate visible failures in large language models, focusing on how configuration choices impact error rates. The tool highlights the trade-offs between model size, speed, and reasoning capabilities rather than just ranking performance.
- GPT-5.4 nano shows answer pass rates increasing from 36.8% with no reasoning to 91.7% at high reasoning settings.
- The benchmark measures visible failure risk across smaller, cheaper, faster, or lower-reasoning model configurations.
- Source code is available on GitHub under the repository adamallcock/obviousbench.
This approach helps users understand how specific model configurations affect reliability and error visibility in practical applications.