Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 11

How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations

Studies show benign and harmful compliance demonstrations are not interchangeable in LLMs. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. Demonstration ordering shows strong recency bias, and models vary in how they handle refusal during in-context learning.

arxiv arXiv cs.AI · 6d ago

Efficient and Sound Probabilistic Verification for AI Agents

A new framework enables secure, probabilistic policy enforcement for AI agents in ambiguous environments. It uses distributionally robust optimization to compute rigorous upper bounds on policy violation probabilities without assuming predicate independence. The method outperforms prior approaches on terminal and tool calling agent benchmarks, improving the security-utility trade-off.

arxiv arXiv cs.AI · 6d ago

Sovereign Execution Broker for Certificate-Bound Agentic Control

The Sovereign Execution Broker (SEB) introduces a runtime enforcement boundary that verifies and executes certified authority in agentic systems. It validates execution contracts, checks validity periods, and ensures policy compliance before invoking infrastructure APIs, providing a short-lived, auditable, and revocable execution capability. The prototype was evaluated on AWS and Kubernetes, measuring latency, revocation propagation, and fault injection resistance.

arxiv arXiv cs.AI · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.LG · 6d ago

Lightweight Defense Against False Data Injection in Power Grids

A new defense framework enhances deep neural networks' resilience to false data injection attacks in power grids by adding a padding layer with pseudofeatures derived from input statistical distributions. This lightweight, model-agnostic approach increases input dimensionality in a randomized, data-aware way, making adversarial perturbations non-transferable and unpredictable, thus effectively countering attacks without performance degradation.

arxiv arXiv cs.LG · 6d ago

Bias Mitigation under Coverage Constraints and the Price of Fairness

A new framework addresses data bias in machine learning by incorporating coverage constraints to ensure sufficient representation of intersectional subgroups. It trades small bias errors for greater data efficiency and formulates bias mitigation as an integer linear program, characterizing the price of fairness as a function of fairness tolerance to guide data governance and legal compliance.

arxiv arXiv cs.LG · 6d ago

Riemannian Sharpness Explains SGD's Bias Toward Flat Minima

This study introduces Riemannian sharpness, a reparametrization-invariant measure of flatness grounded in Fisher Information Matrix geometry. It proves SGD's stationary distribution concentrates at Riemannian-flat minima and links this geometric bias to generalization via a PAC-Bayes bound. Experiments on MNIST and CIFAR-10 show Riemannian sharpness better tracks generalization than Euclidean sharpness, with scaling consistent with theory.

arxiv arXiv cs.LG · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.LG · 6d ago

Evaluator Bias Propagation in Multi-Agent LLM Systems

Contagion Networks introduces a framework to measure how evaluator biases spread among LLM agents. In a 3-agent experiment, biases propagate with coefficients between 0.157 and 0.352, and homogeneous-model agents show significantly weaker contagion than cross-model setups. Increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%.

arxiv arXiv cs.LG · 6d ago

How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations

A study finds benign and harmful compliance demonstrations are not interchangeable in language models. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. The research also reveals recency bias in demonstration ordering and varied model behaviors in handling refusals during in-context learning.

arxiv arXiv cs.LG · 6d ago

Sovereign Execution Broker for Certificate-Bound Agentic Control

The Sovereign Execution Broker (SEB) introduces a runtime enforcement boundary that verifies and executes certified authority in agentic systems. It ensures production mutation authority is isolated from non-deterministic reasoning by validating execution contracts, validity windows, and revocation states before invoking infrastructure APIs. The prototype demonstrates secure, auditable execution on AWS and Kubernetes with measurable latency and fault resilience.

arxiv arXiv cs.LG · 6d ago

Predictability as a Fine-Grained Measure for Privacy

Privacy via predictability introduces a framework that measures privacy leakage as the attacker's incremental ability to predict sensitive information after observing algorithm output. It is generally incomparable to differential privacy but implies mutual-information DP under specific conditions, offering a finer-grained privacy metric tailored to attacker models and sensitive data.

arxiv arXiv cs.LG · 6d ago

Deterministic Multicalibration with Optimal Sample Complexity

A new algorithm achieves minimax-optimal sample complexity for multicalibration using deterministic predictors, resolving a long-standing open problem. The method also produces deterministic predictors that satisfy outcome indistinguishability and enables optimal deterministic omnipredictors and panpredictors, addressing open questions from prior works.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

arxiv arXiv cs.CL · 6d ago

StylisticBias: Visual Cues Drive Most Social Biases in MLLMs

StylisticBias introduces a controlled benchmark to evaluate attribute-level social bias in multimodal large language models. It reveals that age and body type dominate identity-level effects, while fashion style and 15 key visual attributes drive most bias, accounting for nearly 80% of variation. The benchmark highlights that model judgments are most sensitive to appearance-related cues, especially in socioeconomic and style-based contexts.

arxiv arXiv cs.CL · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders these states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.AI · 6d ago

LLM Psychological Profiles Are Measurement Artifacts

A formal psychometric analysis shows that apparent psychological profiles of large language models are primarily driven by response bias, not actual traits. This bias, which causes models to consistently favor one end of a scale, accounts for 81-90% of between-model variation, far exceeding human differences. The study concludes that these profiles are artifacts of instrument design and not true model properties, urging the development of assessments based on response orthogonality.

arxiv arXiv cs.AI · 6d ago

Thermodynamic Measure of Intelligence

Intelligence is defined as the lawful amplification of rare but valid futures. A framework shows that recursive self-simulation is necessary and nearly sufficient for high thermodynamic intelligence, enabling a universal, measurable scale across systems from matter to humans and AI.

arxiv arXiv cs.AI · 6d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.AI · 6d ago

Editorial Alignment in LLM-mediated Knowledge Dissemination

A case study with a Nordic public knowledge institution demonstrates how editorial participation can re-align LLM interfaces with editorial standards. The paper introduces editorial alignment as a design practice in Participatory AI, where editorial values are translated into technical alignment objectives. This approach empowers editors with agency in LLM-mediated knowledge dissemination.