Safety & alignment
arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

arxiv arXiv cs.CL · 7d ago

RedactionBench: A Benchmark for Contextual Privacy in AI

RedactionBench introduces a manually annotated benchmark of 200 diverse documents across 11 domains to evaluate privacy-preserving redaction. It features R-Score, a character-level metric that treats semantically similar redactions equally and reduces bias from formatting choices. Human evaluations reveal significant disagreement on contextual redactions (47.7% consensus), highlighting the subjective nature of privacy and motivating the need for standardized, context-aware benchmarks.

arxiv arXiv cs.CL · 7d ago

LLM-based Metrics Improve Clinical Significance Evaluation in Radiology

A study introduces lightweight, interpretable metrics that sharpen the boundary between clinically significant errors and harmless variations in radiology reports. These metrics outperform large medical LLMs and rival proprietary models, with one-pass training proven effective for cost-sensitive deployment. The two-pass setting fails to consistently improve performance and shifts focus from error detection to robustness.

arxiv arXiv cs.CL · 7d ago

Index Sickness Elimination via Baseline-Log Physical Separation

In a 391-session AI collaboration project, LLMs exhibited 'Index Sickness'—a failure where symbolic complexity leads to self-referential outputs disconnected from reality. The 'Pang Principle' asserts natural language conveys superior semantic quality over symbolic systems, and the 'Baseline-Log Physical Separation' mechanism reduced AI instruction volume by 75% and eliminated recurrence of Index Sickness in subsequent sessions.

arxiv arXiv cs.CL · 7d ago

Human-AI Coevolution Framework Reveals Social Intelligence Emergence

The Human-AI Coevolution Dynamics Framework (HACD-H) introduces a unified model for long-term human-AI interaction, integrating emotional adaptation, memory, and personality into a self-organizing social cognitive system. Results show social intelligence emerges through coevolution, with a significant negative correlation between social intelligence and social cognitive energy (r = -0.391, p < 0.001), and progressive energy reduction over time in interaction trajectories.

arxiv arXiv cs.AI · 7d ago

Towards an Agent-First Web: Redesigning the Web for AI Agents

A new paper proposes a fundamental redesign of the web to prioritize AI agent access, challenging the long-held assumption that humans are the primary web users. It introduces access, economic, and content layer reforms—including agent-identifiable HTTP headers, intent-based subscription models, and a cryptographic provenance system—to enable AI agents as first-class participants, with human supervision and accountability embedded in the architecture.