AI agents
arxiv arXiv cs.CL · 1d ago

MEMPROBE: Benchmark for Long-Term Memory Recovery in Agents

MEMPROBE is a benchmark that evaluates long-term memory in AI agents by reconstructing a user's hidden state from the agent's memory after interaction. It tests 5 memory systems across 50 simulated users with 31 dimensions each, finding that task completion is high even for memoryless agents, while memory recovery remains moderate and drops under top-k retrieval. MEMPROBE enables direct, auditable assessment of memory retention and proposes recovery as a key objective for future agent development.

arxiv arXiv cs.LG · 1d ago

Distilling Transformers into Recurrent Transformers for Efficient Memory

A new distillation method transfers the observation compression strategy of full-history transformers to recurrent models. By training a teacher model to compress observation histories into fixed-size bottlenecks, the approach aligns the student's memory with the teacher's compression. This enables recurrent transformers to achieve near-full-history performance with linear-time complexity, making them viable for long-horizon robotics applications.

arxiv arXiv cs.AI · 1d ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining multimodal data to align with user and downstream intents. It uses a two-stage pipeline with factual anchors to generate a large-scale dataset across five domains and achieves strong alignment via supervised fine-tuning and GRPO. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.AI · 1d ago

LLM-Agent Oversight Must Shift from Calibration to Action-Conditioned Control

Current oversight of LLM agents relies on scalar risk scores, but this fails to capture whether an intervention improves outcomes. The paper introduces "intervention advantage" as the key metric, showing that action-conditioned control outperforms scalar routing across benchmarks, with significant regret reduction in interactive regimes. Calibration alone does not resolve the underlying mismatch in control performance.

media r/LocalLLaMA · 1d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.