Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 10

How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations

A study finds benign and harmful compliance demonstrations are not interchangeable in language models. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. The research also reveals recency bias in demonstration ordering and varied model behaviors in handling refusals during in-context learning.

arxiv arXiv cs.LG · 6d ago

Sovereign Execution Broker for Certificate-Bound Agentic Control

The Sovereign Execution Broker (SEB) introduces a runtime enforcement boundary that verifies and executes certified authority in agentic systems. It ensures production mutation authority is isolated from non-deterministic reasoning by validating execution contracts, validity windows, and revocation states before invoking infrastructure APIs. The prototype demonstrates secure, auditable execution on AWS and Kubernetes with measurable latency and fault resilience.

arxiv arXiv cs.LG · 6d ago

Predictability as a Fine-Grained Measure for Privacy

Privacy via predictability introduces a framework that measures privacy leakage as the attacker's incremental ability to predict sensitive information after observing algorithm output. It is generally incomparable to differential privacy but implies mutual-information DP under specific conditions, offering a finer-grained privacy metric tailored to attacker models and sensitive data.

arxiv arXiv cs.LG · 6d ago

Deterministic Multicalibration with Optimal Sample Complexity

A new algorithm achieves minimax-optimal sample complexity for multicalibration using deterministic predictors, resolving a long-standing open problem. The method also produces deterministic predictors that satisfy outcome indistinguishability and enables optimal deterministic omnipredictors and panpredictors, addressing open questions from prior works.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

StylisticBias: Visual Cues Drive Most Social Biases in MLLMs

StylisticBias introduces a controlled benchmark to evaluate attribute-level social bias in multimodal large language models. It reveals that age and body type dominate identity-level effects, while fashion style and 15 key visual attributes drive most bias, accounting for nearly 80% of variation. The benchmark highlights that model judgments are most sensitive to appearance-related cues, especially in socioeconomic and style-based contexts.

arxiv arXiv cs.CL · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders these states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.AI · 6d ago

LLM Psychological Profiles Are Measurement Artifacts

A formal psychometric analysis shows that apparent psychological profiles of large language models are primarily driven by response bias, not actual traits. This bias, which causes models to consistently favor one end of a scale, accounts for 81-90% of between-model variation, far exceeding human differences. The study concludes that these profiles are artifacts of instrument design and not true model properties, urging the development of assessments based on response orthogonality.

arxiv arXiv cs.AI · 6d ago

Thermodynamic Measure of Intelligence

Intelligence is defined as the lawful amplification of rare but valid futures. A framework shows that recursive self-simulation is necessary and nearly sufficient for high thermodynamic intelligence, enabling a universal, measurable scale across systems from matter to humans and AI.

arxiv arXiv cs.AI · 6d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.AI · 6d ago

Editorial Alignment in LLM-mediated Knowledge Dissemination

A case study with a Nordic public knowledge institution demonstrates how editorial participation can re-align LLM interfaces with editorial standards. The paper introduces editorial alignment as a design practice in Participatory AI, where editorial values are translated into technical alignment objectives. This approach empowers editors with agency in LLM-mediated knowledge dissemination.

arxiv arXiv cs.AI · 6d ago

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

A vision-based model with parameter-efficient adaptation scores student drawings in science education. It uses confidence-aware scoring to automatically evaluate high-confidence responses while deferring uncertain ones to human review, improving reliability and practicality in large-scale assessments.

arxiv arXiv cs.AI · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, accelerated safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.LG · 6d ago

EFIQA: Label-Free Fundus Image Quality Assessment with Explainability

EFIQA proposes a label-free framework for fundus image quality assessment that uses anatomical priors to generate spatial quality maps. It first trains an unsupervised anomaly detector via masked anatomical inpainting to identify missing vasculature, then distills this knowledge into a shallow adapter for quality mapping. Evaluation on external datasets shows EFIQA outperforms supervised methods in both performance and explainability across diverse quality criteria.

arxiv arXiv cs.LG · 6d ago

Federated Conformal Risk Control via Risk-Curve Shrinkage

A new federated conformal risk control method addresses coverage failures in hospital-level predictions. On real brain tumor data from 20 institutions, pooled calibration fails 40% of sites, with one exceeding false-negative targets by 7.8 percentage points. The proposed shrinkage-based protocol uses empirical risk curves and a hyperparameter n0=19 to achieve 2.7/20 coverage violations at 2.0x prediction set stretch, while preserving marginal guarantees and ensuring no patient-level data leaves any site.

arxiv arXiv cs.LG · 6d ago

Effective Dimension Governs Generalization in Quantum Vision Models

Quantum vision models exhibit better generalization with more entanglement or quantum noise, phenomena unified by the effective dimension of the noise-shaped quantum feature kernel. This dimension acts as a regularization mechanism in overfitting regimes, with amplitude damping improving test accuracy by up to 13% along an inverted-U sweet spot.

arxiv arXiv cs.LG · 6d ago

SLiR: Shifting-based Linear Relaxations for Activation Functions

SLiR enables sound, tight linear relaxations of general activation functions using only Lipschitz constants or critical points. It achieves up to 7.8x more verification properties than state-of-the-art methods by efficiently computing upper and lower bounds via a shifting procedure.

arxiv arXiv cs.LG · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, fast safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

lab Claude Code Releases · 6d ago

v2.1.183 Release Notes

v2.1.183 improves auto mode safety by blocking destructive git and destroy commands without explicit user consent. It adds deprecation warnings for models, introduces attribution.sessionUrl to hide session links, and fixes multiple issues including terminal behavior, subagent performance, and input handling in web and tmux environments.

arxiv arXiv cs.CL · 6d ago

Introducing P-CHR AUC and CRR for Semantic Caching

We introduce Precision-Cache Hit Ratio (P-CHR) AUC and Calibration Retention Rate (CRR) to address the calibration gap in semantic caching. These metrics evaluate precision across cache utilization levels and measure how offline ranking quality persists in deployment. Our analysis shows the gap is driven by training objectives, not data scale, and post-hoc calibration only partially resolves it.