Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 10

Rubric-Guided Counterfactual Recommendations for Medical Communication

A new pipeline uses language models to recommend minimal, interpretable changes to patient-doctor communication features like tone and personalization. These changes increase predicted positive feedback by an average of 6.41% and are non-negative for 93.31% of cases, without altering medical content.

arxiv arXiv cs.CL · 7d ago

Speech-Based Dementia Assessment with Error Mitigation

This study improves accuracy in dementia screening by using speech-derived features from the German Syndrom-Kurz-Test. Models combine transcript scores and Whisper embeddings to reduce scoring errors and approximate expert ratings by compensating for missing motor subtests. The approach achieves strong correlation with expert ratings and effectively distinguishes cognitive status groups.

arxiv arXiv cs.CL · 7d ago

Index Sickness Elimination via Baseline-Log Physical Separation

In a 391-session AI collaboration project, LLMs exhibited 'Index Sickness'—a failure where symbolic complexity leads to self-referential outputs disconnected from reality. The 'Pang Principle' asserts natural language conveys superior semantic quality over symbolic systems, and the 'Baseline-Log Physical Separation' mechanism reduced AI instruction volume by 75% and eliminated recurrence of Index Sickness in subsequent sessions.

arxiv arXiv cs.CL · 7d ago

Human-AI Coevolution Framework Reveals Social Intelligence Emergence

The Human-AI Coevolution Dynamics Framework (HACD-H) introduces a unified model for long-term human-AI interaction, integrating emotional adaptation, memory, and personality into a self-organizing social cognitive system. Results show social intelligence emerges through coevolution, with a significant negative correlation between social intelligence and social cognitive energy (r = -0.391, p < 0.001), and progressive energy reduction over time in interaction trajectories.

arxiv arXiv cs.AI · 7d ago

TRUST: Target-Confidence Recourse with tSeTlin Machines

TRUST enables users to specify desired prediction confidence when generating counterfactual explanations. By directly optimizing for confidence targets using a Probabilistic Tsetlin Machine and Bayesian optimization, TRUST produces more robust and interpretable recourse than traditional boundary-based methods, achieving perfect robustness with low cost and high confidence on real-world datasets.

arxiv arXiv cs.AI · 7d ago

ImpSH Improves Implicit Hate Speech Detection Across Domains

ImpSH, a triplet-based framework, aligns posts with implied statements and uses context-bounded semi-hard negatives to enhance detection of implicit hate speech. Evaluated on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH outperforms standard supervised contrastive methods in cross-domain settings, showing improved generalizability and stability.

arxiv arXiv cs.AI · 7d ago

Scaling AEB with Massive Unlabeled Data via Meta-Feedback SSL

A meta-feedback semi-supervised learning framework enables scaling of automatic emergency braking using massive unlabeled fleet data. The stabilized approach reduces pseudo-label errors through noise-aware decoupling and kinematics-gated pseudo-labeling, improving safety with a 100:1 positive-to-false activation ratio and 35% more accident-free driving mileage compared to rule-based systems.

arxiv arXiv cs.AI · 7d ago

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench introduces a benchmark to evaluate AI4Science safety by assessing models across 7 disciplines, 31 subdisciplines, and 10 risk dimensions. It evaluates both mainstream and science-oriented LLMs to identify specific gaps in risk recognition and avoidance within high-stakes scientific contexts.

arxiv arXiv cs.AI · 7d ago

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP evaluates how well models complete tasks using private data without leaking it. Across 22 models, all show non-trivial privacy leakage, with instruction-following ability linked to higher leakage. Structural private field isolation prevents leakage by replacing private fields with hash keys, maintaining task accuracy without sacrificing privacy.

arxiv arXiv cs.AI · 7d ago

Towards an Agent-First Web: Redesigning the Web for AI Agents

A new paper proposes a fundamental redesign of the web to prioritize AI agent access, challenging the long-held assumption that humans are the primary web users. It introduces access, economic, and content layer reforms—including agent-identifiable HTTP headers, intent-based subscription models, and a cryptographic provenance system—to enable AI agents as first-class participants, with human supervision and accountability embedded in the architecture.

arxiv arXiv cs.AI · 7d ago

XAI reveals key drivers in European electricity markets

A study uses SHAP and SSHAP techniques to analyze electricity price drivers in 39 European bidding zones. It finds solar energy has a disproportionate impact on prices, gas remains a dominant factor, and interconnections highlight regional interdependence. The research also builds a synthetic EU-wide market to examine a fully integrated scenario.

arxiv arXiv cs.AI · 7d ago

Human-AI Coevolution Framework Reveals Social Intelligence Emergence

The Human-AI Coevolution Dynamics Framework (HACD-H) introduces a unified model for long-term human-AI interaction, integrating emotional adaptation, memory, and personality into a self-organizing system. Results show social intelligence emerges through coevolution, with a significant negative correlation between social intelligence and social cognitive energy (r = -0.391, p < 0.001), and progressive energy reduction over time.

media Don't Worry About the Vase · 7d ago

No Jailbreak: Fable's 'Fix This Code' Was a Fake Scenario

The article confirms there was no actual jailbreak of Anthropic's Fable AI. Instead, a test involving fake code with planted vulnerabilities was conducted, where Fable refused to review the code and only responded to a request to 'fix this code' after manual steps. Katie Moussouris of Luta Security states this scenario should not trigger export controls, calling it a deliberate, engineered test that undermines claims of a security breach.

media Interconnects · 7d ago

State of the Interconnects Blog Mid-2026

The author outlines three core goals: clarifying frontier AI model evolution, building an open AI ecosystem, and creating institutions to support these missions. Interconnects serves as a raw, independent voice for frontier AI thinking, with a dedicated technical audience of over 70K subscribers. The blog maintains paywalled comments to prevent AI-generated noise, and the author plans to reach 1000 paid subscribers by summer, emphasizing financial sustainability and independence amid rising AI service costs.

media r/LocalLLaMA · 8d ago

Rio 3.5 397B likely a failed embezzlement of AI funding

The Rio 3.5 397B AI model was reportedly developed by merging a Nex N2 Pro without additional training, using funds intended for proper model development. The official documentation initially claimed advanced training, but was later updated to admit the shallow merge, while still asserting additional training occurred, and the original model was removed from Hugging Face.

media r/LocalLLaMA · 8d ago

Elias in the Lighthouse: Diagnosing Low Diversity in LLM Stories

A new study examines the limited diversity in stories generated by large language models, using the recurring character Elias in the lighthouse as a case study. The research highlights how such patterns suggest systemic biases in training data and model outputs.

arxiv arXiv cs.LG · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.LG · 8d ago

ScaFE: Using LLMs to Extract Clinically Meaningful Scar Features

ScaFE repositions large language models as feature engineers for scar classification, generating executable Python code from clinical criteria to extract interpretable features. The framework achieves superior performance with limited data, preserves privacy by processing images locally, and produces clinically grounded features aligned with established scoring systems like the Vancouver Scar Scale.

arxiv arXiv cs.LG · 8d ago

Edge Flow: A Continuous-Time Model for Gradient Descent at Edge of Stability

Edge Flow is a tractable, predictive continuous-time model that captures gradient descent dynamics at the edge of stability. It decomposes dynamics into center, oscillation direction, and magnitude, with self-stabilization of sharpness emerging from coupled feedback. The model requires only two gradient evaluations and one Hessian-vector product per iteration and outperforms prior models in tracking oscillations and explaining instabilities at EoS.

arxiv arXiv cs.LG · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model behavior. The default escaping provides no protection for most role delimiter schemes and cannot replace a clear separation of instructions and data.