Cognition — korshunov.ai

Lab · Cognition

Introducing COGNITIVE ATROPHY BENCH for LLM Mental-Health Interactions

A new benchmark, COGNITIVE ATROSPHY BENCH, measures how LLMs induce cognitive decline in mental-health conversations. Built from 1,576 human-generated counseling sessions and evaluated by clinical experts, it identifies patterns like directive advice and validation that may reduce user autonomy. The tool introduces metrics such as UIRI and ARI to assess atrophy risk and track behavioral trajectories across user interactions.

arxiv arXiv cs.AI · 8d ago

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

VERITAS introduces a generator-verifier framework that enables robots to improve policies in real time without additional training. A visual verifier evaluates actions at inference time, allowing consistent performance gains through verified rollouts that serve as effective supervision for offline policy improvement. Post-training with these verified rollouts matches expert demonstrations in efficiency, without human intervention.

arxiv arXiv cs.LG · 9d ago

Geometric Action Model for Robot Policy Learning

The Geometric Action Model (GAM) enables robot policies to reason about 3D physical interactions by repurposing a pretrained geometric foundation model. GAM splits the GFM to serve as both an observation encoder and a causal future predictor, then routes predicted future geometry and actions through the same backbone, achieving accurate, robust, and efficient manipulation performance in simulation and real-robot benchmarks.

arxiv arXiv cs.CL · 2d ago

AgentCIBench Evaluates Privacy Risks in Computer-Use Agents

AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.

media AI News (smol.ai) · 4d ago

GLM-5.2 Breakout and Open-Model Progress Highlighted

Zhipu's GLM-5.2 emerged as the top open-weight model, praised for its frontier-adjacent performance in daily use, with improvements in coding tasks and reduced 1M-token inference cost via IndexShare. It outperformed other open models in agentic knowledge work benchmarks, reaching 1266 Elo in Artificial Analysis' AA-Briefcase test, though only 3% of tasks were fully satisfied by top models, indicating persistent challenges in real-world long-horizon agent performance.

arxiv arXiv cs.LG · 8d ago

Flash Endurance as Depreciating Capital in Robot Memory

A robot's flash memory degrades with each write, forming a non-renewable asset. A wear-aware pricing model uses a shadow price $η$ to guide memory placement across RAM, NVM, and cloud, with optimal routing depending on whether task value increases with memory persistence. The sign of the value-write association $χ$ varies by deployment: positive in long-horizon manipulation, null in short-horizon tasks, and negative in teleoperation. The endurance budget is binding only on low-end QLC/eMMC memory, and while wear-aware routing aligns with task value, actual performance improvements remain unverified in data.

arxiv arXiv cs.AI · 8d ago

Flash Endurance as Depreciating Capital in Robot Memory

A robot's flash memory endurance is a non-renewable asset that degrades with each write. A wear-aware pricing model introduces a shadow price $η$ to guide memory placement across RAM, NVM, and cloud, with optimal routing depending on the value-write association $χ$. Empirical measurements show $χ$ is positive in long-horizon manipulation, null in short-horizon tasks, and negative in teleoperation, and the endurance budget is binding only on low-end QLC/eMMC memory, where wear-aware control influences routing based on task value without improving performance.