Code generation
arxiv arXiv cs.LG · 6d ago

Probe-and-Refine Tuning Improves Coding Agent Performance

A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.

arxiv arXiv cs.AI · 6d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.CL · 6d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.CL · 6d ago

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore integrates diagnostic scoring and instructional feedback using a shared latent ability model. It features a trait-adaptive neural IRT scorer based on GPCM, a ZPD-scaffolded feedback generator that tailors instruction by proficiency level, and a multi-perspective evaluation strategy. Experiments on ASAP++ show competitive scoring and more pedagogically aligned feedback compared to existing methods.