Results
Sort
Reset
arxiv arXiv cs.LG · 7d ago

Probe-and-Refine Tuning Improves Coding Agent Performance

A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.

arxiv arXiv cs.LG · 7d ago

Execution-State Capsules for Low-Latency On-Device AI Serving

Execution-state capsules enable graph-bound checkpointing and restoration of complete execution state, including KV, recurrent, and convolution states, for low-latency, small-batch on-device AI serving. On RTX 5090 and Jetson AGX Thor, capsule restore achieves byte-exact and token-identical correctness, with sub-millisecond GPU operations and TTFT speedups up to 27x at 16k tokens, demonstrating significant latency reduction in interactive AI workflows.

arxiv arXiv cs.AI · 7d ago

Lean as Process-Verified Reward Oracle in RL for Theorem Proving

This work shows that Lean can serve as a symbolic process oracle, providing fine-grained, verified feedback during reinforcement learning. By parsing proof attempts into tactic sequences and using Lean's elaboration to mark sound steps and first failures, the system generates dense, type-theoretic reward signals. Experiments demonstrate tactic-level supervision outperforms outcome-only methods on benchmarks like MiniF2F and ProofNet, highlighting Lean's role as both evaluator and training reward source.

arxiv arXiv cs.AI · 7d ago

Dual-Agent Framework for Cross-Model Verified Translation

A dual-agent framework converts natural-language experiment protocols into executable commands for robotic lab platforms. It uses a Parser Agent and a rule-based mapping engine to translate protocols, with a heterogeneous LLM Validation Agent ensuring accuracy and triggering self-correction. The framework successfully enables end-to-end autonomous execution of microplate-based experiments like the Bradford assay.

arxiv arXiv cs.AI · 7d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.AI · 7d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.AI · 7d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, accelerated safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.