arxiv arXiv cs.LG · 9d ago · research

Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models

from English

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While some operators reduce compute usage or recover correct programs, none outperform BoN in accuracy, due to systemic limitations like coverage walls and consensus traps. An expression-layer recovery (M1) improves performance on HumanEval+ by 12 tasks, with no harm or leakage, and shows consistent results across model cells.

Importance 2/3 New harness with differentiators arXiv cs.LG DeepSeek Code generation Evaluation & benchmarks Training methods

Benchmarks

Benchmark	Model	Score
HumanEval+	DeepSeek-Coder-1.3B	12tasks
HumanEval	DeepSeek-Coder-1.3B	—
MBPP	DeepSeek-Coder-1.3B	—
MBPP+	DeepSeek-Coder-1.3B	—

Read original