Current coding benchmarks were designed before agentic software engineering and fail to capture the complexity of real-world systems. They conflate model performance with the entire harness, ignore valid alternative solutions, and lack feedback signals at individual component levels, making iterative improvement difficult.