Build Real Agentic Apps with CUGA: 24 Working Examples
CUGA introduces a lightweight harness enabling developers to build real agentic applications. It includes 24 working examples demonstrating practical implementations across various use cases.
CUGA introduces a lightweight harness enabling developers to build real agentic applications. It includes 24 working examples demonstrating practical implementations across various use cases.
AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.
Tmax presents the strongest open RL recipe for terminal agents, achieving 27% on Terminal-Bench 2.0 with only 9B parameters. It uses a novel data taxonomy to generate over 2.5x more terminal environments than prior datasets, enabling efficient training with a simple, outcome-only recipe. The dataset, models, and code are open-sourced at https://github.com/hamishivi/tmax.
SelfCompact enables language models to autonomously decide when and how to compact accumulated context during reasoning. By combining a model-invoked summarization tool with a lightweight rubric that guides compaction based on trajectory structure, it achieves effective adaptive compaction without fine-tuning. Results show it matches or exceeds fixed-interval methods on math and agentic search benchmarks, improving baselines by up to 18.1 points on math and 5-9 points on search, at 30-70% lower token cost.
EnterpriseClawBench is a benchmark built from real workplace sessions, featuring 852 reproducible tasks with detailed metadata. The best configuration achieves only 0.663 (Codex with GPT-5.5), highlighting the need for multi-dimensional evaluation of enterprise agents.
A Reddit post questions whether Sakana Fugu is merely an orchestration wrapper rather than a genuine AI model, suggesting it may be perceived as a mythos 5 killer due to misleading implications. The post raises concerns about users misinterpreting its capabilities.
OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.
Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial input, generating both speech and synchronized facial motion. It uses a VQ-VAE face codec to encode and reconstruct 3D head meshes from facial videos into discrete face tokens, and a Face Transformer module to generate these tokens non-autoregressively for real-time audiovisual output. Experiments show Moshi-Face achieves audiovisual alignment with low latency while maintaining original dialogue quality.
CFAgentBench introduces a reproducible, self-hostable environment with 1,014 machine-gradeable tasks across eight domains, grounded in real-world sources. It features 40 oracle-validated tasks with executable evaluators that assess functional correctness via state diffs and output regexes, including a money-movement guard requiring human approval for payments. A key finding is that top agents lose 43% of successes when repeating tasks under temperature-0 decoding, indicating single-attempt performance does not reflect real-world deployability.
Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.
A new metric, coupling gain gamma, measures how agents adjust opinions when neighbors' views are perturbed. It reveals that frontier LLMs do not spontaneously polarize, and a diagnostic of final versus initial opinion shows that claimed emergent consensus in prior work involves model artifacts. Valid consensus emerges only when group-level, modality-matched coupling is considered, not single-neighbour interactions.
Omio leverages OpenAI to enhance conversational travel experiences. The company uses AI to accelerate product development and transition into an AI-native business model.
PlanBench-XL introduces a benchmark of 327 retail tasks across 1,665 tools to evaluate LLM agents' ability to iteratively retrieve and use tools in long-horizon planning. It includes a blocking mechanism simulating tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and adaptability.
VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering auditability, scalability, and verifiability in real-world financial scenarios.
MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.
Each MCP server dumps its full tool list into the model's context before any prompt, using up to 24,000 tokens for 62 tools. A local gateway implementing lazy discovery reduces tool-definition overhead by 97%, cutting token usage from ~24k to ~660 per request, with 90% fewer total tokens over a task, without affecting task success rate.
LRE is a CPU-only, language-model-free system that learns which interaction history units are load-bearing. It outperforms baselines in accuracy-cost balance, reducing peak context size by up to 52% and improving task completion by 37% in some cases. LRE achieves superior answer quality with 68% fewer tokens and requires no annotations or neural computation for training.
Beaver is an agent harness that extracts structured information from scientific papers by integrating multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. It achieves 81.0 on the Gold-Referenced Attribute Score, outperforming frontier agents by over 23 points, with key gains on high-value attributes requiring cross-modal reasoning.
AdaMem learns what to remember for each user from feedback, reducing memory bloat and improving QA accuracy by up to 9.0% over uniform baselines while shrinking memory volume by 9%.
Dementia-Agents introduces a clinically aligned multi-agent framework for real-world dementia staging and phenotyping. It improves diagnostic performance over monolithic models and prior systems, while maintaining domain-level interpretability, using data from 1,066 patients across two cognitive neurology services.