AI agents
arxiv arXiv cs.AI · 2d ago

LLM-Agent Oversight Must Shift from Calibration to Action-Conditioned Control

Current oversight of LLM agents relies on scalar risk scores, but this fails to capture whether an intervention improves outcomes. The paper introduces "intervention advantage" as the key metric, showing that action-conditioned control outperforms scalar routing across benchmarks, with significant regret reduction in interactive regimes. Calibration alone does not resolve the underlying mismatch in control performance.

media r/LocalLLaMA · 2d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.

arxiv arXiv cs.CL · 2d ago

Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.