Safety & alignment
arxiv arXiv cs.AI · 6d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.AI · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, accelerated safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.LG · 6d ago

EFIQA: Label-Free Fundus Image Quality Assessment with Explainability

EFIQA proposes a label-free framework for fundus image quality assessment that uses anatomical priors to generate spatial quality maps. It first trains an unsupervised anomaly detector via masked anatomical inpainting to identify missing vasculature, then distills this knowledge into a shallow adapter for quality mapping. Evaluation on external datasets shows EFIQA outperforms supervised methods in both performance and explainability across diverse quality criteria.

arxiv arXiv cs.LG · 6d ago

Federated Conformal Risk Control via Risk-Curve Shrinkage

A new federated conformal risk control method addresses coverage failures in hospital-level predictions. On real brain tumor data from 20 institutions, pooled calibration fails 40% of sites, with one exceeding false-negative targets by 7.8 percentage points. The proposed shrinkage-based protocol uses empirical risk curves and a hyperparameter n0=19 to achieve 2.7/20 coverage violations at 2.0x prediction set stretch, while preserving marginal guarantees and ensuring no patient-level data leaves any site.

arxiv arXiv cs.LG · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, fast safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.CL · 6d ago

Sequential DPO Shows Variable Preference Impact Across Settings

A study of sequential Direct Preference Optimization finds that later training does not uniformly degrade earlier learned preferences. The effect varies by objective relationship, signal strength, and training order, ranging from partial degradation to positive transfer. Pair-level analysis reveals heterogeneous changes, with high-confidence preference pairs sometimes improving despite aggregate metric stability.

arxiv arXiv cs.CL · 6d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 6d ago

AI-Driven Deliberation: Scaling Inclusivity and Empowering Marginalised Groups

Large Language Models can scale democratic deliberation by scaffolding argumentation and reducing linguistic biases. The chapter uses Systemic-Functional Linguistics to analyze how socio-demographic and communicative variations affect participation, highlighting AI's potential to challenge exclusionary norms while cautioning against over- or under-claiming its capabilities. It calls for ethical safeguards and further research to ensure equitable AI-assisted engagement.

arxiv arXiv cs.CL · 6d ago

REDACT: Multilingual PII Benchmark with Systematic Control

REDACT introduces a systematically controlled multilingual benchmark for personally identifiable information detection, featuring 51 entity types, 4,127 surface-form patterns, and 25 languages. It evaluates five detectors across 1,000 records, revealing that rule-based models fail on high-stakes data while LLMs perform better, especially in high-sensitivity categories. A reference-free LLM assessment confirms sensitivity-tier assignment as the most challenging evaluation axis.