A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While two operators—expression-layer recovery and adaptive consensus early-stop—offer benefits in compute efficiency or program recovery, none outperform BoN in accuracy. The results highlight systemic limitations in error detection and coverage, suggesting that model harnesses and error coverage must be improved before post-hoc reasoning is considered.
arxiv
arXiv cs.CL
·
9d ago
·
research
Post-Hoc Operators Fail to Improve Accuracy in Small Code Models
from English
Importance 2/3
New harness with differentiators
arXiv cs.CL
DeepSeek
Code generation
Evaluation & benchmarks
Research paper
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| HumanEval+ | DeepSeek-Coder-1.3B | 12tasks |
| MBPP+ | DeepSeek-Coder-1.3B | — |