Safety & alignment
arxiv arXiv cs.AI · 6d ago

Sovereign Execution Broker for Certificate-Bound Agentic Control

The Sovereign Execution Broker (SEB) introduces a runtime enforcement boundary that verifies and executes certified authority in agentic systems. It validates execution contracts, checks validity periods, and ensures policy compliance before invoking infrastructure APIs, providing a short-lived, auditable, and revocable execution capability. The prototype was evaluated on AWS and Kubernetes, measuring latency, revocation propagation, and fault injection resistance.

arxiv arXiv cs.LG · 6d ago

Lightweight Defense Against False Data Injection in Power Grids

A new defense framework enhances deep neural networks' resilience to false data injection attacks in power grids by adding a padding layer with pseudofeatures derived from input statistical distributions. This lightweight, model-agnostic approach increases input dimensionality in a randomized, data-aware way, making adversarial perturbations non-transferable and unpredictable, thus effectively countering attacks without performance degradation.

arxiv arXiv cs.LG · 6d ago

Riemannian Sharpness Explains SGD's Bias Toward Flat Minima

This study introduces Riemannian sharpness, a reparametrization-invariant measure of flatness grounded in Fisher Information Matrix geometry. It proves SGD's stationary distribution concentrates at Riemannian-flat minima and links this geometric bias to generalization via a PAC-Bayes bound. Experiments on MNIST and CIFAR-10 show Riemannian sharpness better tracks generalization than Euclidean sharpness, with scaling consistent with theory.

arxiv arXiv cs.LG · 6d ago

How Safety-Aligned LLMs Interpret Mixed Compliance Demonstrations

A study finds benign and harmful compliance demonstrations are not interchangeable in language models. Benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization playing a key role in preventing harmful compliance. The research also reveals recency bias in demonstration ordering and varied model behaviors in handling refusals during in-context learning.

arxiv arXiv cs.LG · 6d ago

Sovereign Execution Broker for Certificate-Bound Agentic Control

The Sovereign Execution Broker (SEB) introduces a runtime enforcement boundary that verifies and executes certified authority in agentic systems. It ensures production mutation authority is isolated from non-deterministic reasoning by validating execution contracts, validity windows, and revocation states before invoking infrastructure APIs. The prototype demonstrates secure, auditable execution on AWS and Kubernetes with measurable latency and fault resilience.

arxiv arXiv cs.CL · 6d ago

StylisticBias: Visual Cues Drive Most Social Biases in MLLMs

StylisticBias introduces a controlled benchmark to evaluate attribute-level social bias in multimodal large language models. It reveals that age and body type dominate identity-level effects, while fashion style and 15 key visual attributes drive most bias, accounting for nearly 80% of variation. The benchmark highlights that model judgments are most sensitive to appearance-related cues, especially in socioeconomic and style-based contexts.

arxiv arXiv cs.AI · 6d ago

LLM Psychological Profiles Are Measurement Artifacts

A formal psychometric analysis shows that apparent psychological profiles of large language models are primarily driven by response bias, not actual traits. This bias, which causes models to consistently favor one end of a scale, accounts for 81-90% of between-model variation, far exceeding human differences. The study concludes that these profiles are artifacts of instrument design and not true model properties, urging the development of assessments based on response orthogonality.

arxiv arXiv cs.AI · 6d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.