ObviousBench: A Benchmark for Visible LLM Failures in Smaller Models

ObviousBench is a new benchmark designed to evaluate visible failures in large language models, focusing on how configuration choices impact error rates. The tool highlights the trade-offs between model size, speed, and reasoning capabilities rather than just ranking performance.

GPT-5.4 nano shows answer pass rates increasing from 36.8% with no reasoning to 91.7% at high reasoning settings.
The benchmark measures visible failure risk across smaller, cheaper, faster, or lower-reasoning model configurations.
Source code is available on GitHub under the repository adamallcock/obviousbench.

This approach helps users understand how specific model configurations affect reliability and error visibility in practical applications.