A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While some operators reduce compute usage or recover correct programs, none outperform BoN in accuracy, due to systemic limitations like coverage walls and consensus traps. An expression-layer recovery (M1) improves performance on HumanEval+ by 12 tasks, with no harm or leakage, and shows consistent results across model cells.
arxiv
arXiv cs.LG
·
9d ago
·
research
Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models
from English
Importance 2/3
New harness with differentiators
arXiv cs.LG
DeepSeek
Code generation
Evaluation & benchmarks
Training methods
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| HumanEval+ | DeepSeek-Coder-1.3B | 12tasks |
| HumanEval | DeepSeek-Coder-1.3B | — |
| MBPP | DeepSeek-Coder-1.3B | — |
| MBPP+ | DeepSeek-Coder-1.3B | — |