Evaluation & benchmarks
arxiv arXiv cs.AI · 8d ago

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

VERITAS introduces a generator-verifier framework that enables robots to improve policies in real time without additional training. A visual verifier evaluates actions at inference time, allowing consistent performance gains through verified rollouts that serve as effective supervision for offline policy improvement. Post-training with these verified rollouts matches expert demonstrations in efficiency, without human intervention.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

NarrativeWorldBench evaluates 21 LLMs on nine narrative-structure metrics across horizons of 10 to 200 episodes, with cross-lingual support in Hindi, Tamil, Telugu, and Marathi. N-VSSM, a latent world model using Mamba-2, achieves plot-beat F1 of at least 0.84 across all horizons with 4x lower compute than closed-frontier models and outperforms Claude Opus 4.5 in long-arc consistency and controllability in a professional writer study.

arxiv arXiv cs.CL · 8d ago

LLM Recommendation Bias and Brand Competition Dynamics

Well-known brands dominate LLM recommendations by 100% when products are identical, but this advantage vanishes with a mere +0.1-star rating edge. Authority-style marketing claims, such as fabricated clinical evidence, break this dominance at a bias surplus of +0.17 rating points, with models responding differently. A social dilemma emerges in multi-brand competition, where collective optimization reduces individual payoff from +0.802 to +0.007 and eliminates recommendations for non-participating brands.

arxiv arXiv cs.CL · 8d ago

AIPatient Arena: EHR-grounded evaluation of LLMs in clinical workflows

AIPatient Arena evaluates large language models in end-to-end clinical consultations using EHR-grounded patient-specific knowledge graphs. It assesses LLMs across eight clinical competence dimensions, revealing strong performance in interview skills, ethics, and explanation clarity, but persistent weaknesses in handling ambiguity, information coverage, and diagnostic reasoning, with process failures like repetitive questioning and omitted history.

arxiv arXiv cs.CL · 8d ago

Second-Order Bias in LLMs: Evaluating Judgment-Based Bias

A new study identifies second-order bias in large language models—social bias in their judgments about biased content. Using entitlement epistemology, the research develops a reasoning task to assess whether LLMs accept or reject biased texts based on demographics, revealing implicit biases that vary by target group and evade safety guardrails. The work introduces two metrics to quantify these biases and calls for more theoretically grounded evaluation methods in NLP.

arxiv arXiv cs.CL · 8d ago

Expressivity Analysis of Hierarchical Modelling in Deep Transformers

This paper analyzes deep transformer expressiveness using bounded-depth grammars. It constructs transformers with positional attention where model depth scales linearly with grammar depth, and neuron count grows quadratically with production rules. The results support the linear representation hypothesis by showing these models can encode abstract grammatical states in low-dimensional, linearly separable subspaces.