Research paper — korshunov.ai

Topic · Research paper

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.AI · 8d ago

Variability in AI-Generated Software: A New Product-Line Approach

An exploratory analysis of 10 vibe-coded C/C++ projects reveals near-zero in-artifact variability, with all decisions resolved at generation time. The paper proposes Variability by Regeneration (VbR), a product-line approach where an LLM acts as a derivation engine, generating tailored binaries from declarative specifications, with a variant dispatcher routing user requests to the correct binary. VbR shifts variability into specifications, not code, offering a new paradigm for SPL engineering.

arxiv arXiv cs.AI · 8d ago

Technical Taxonomy of LLM Agent Communication Protocols

A new taxonomy classifies LLM agent communication protocols across five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Analysis shows hybrid payloads, session-state persistence, and runtime schema negotiation are common, with decentralized discovery remaining rare. The study predicts short-term convergence toward unified agent-to-agent and agent-to-context protocols, and long-term evolution toward a federated, layered protocol stack.

arxiv arXiv cs.AI · 8d ago

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

OrthoReg introduces orthogonal regularization to prevent neural components from relearning symbolic structures in hybrid dynamical systems. By directly penalizing overlap between symbolic and neural parts, it enables a complementary decomposition where symbolic models capture expressible physics and neural components handle remaining dynamics. On benchmarks with partial library mismatch, OrthoReg improves symbolic recovery and out-of-distribution performance.

media Don't Worry About the Vase · 8d ago

No Jailbreak: Fable's 'Fix This Code' Was a Fake Scenario

The article confirms there was no actual jailbreak of Anthropic's Fable AI. Instead, a test involving fake code with planted vulnerabilities was conducted, where Fable refused to review the code and only responded to a request to 'fix this code' after manual steps. Katie Moussouris of Luta Security states this scenario should not trigger export controls, calling it a deliberate, engineered test that undermines claims of a security breach.

arxiv arXiv cs.LG · 8d ago

ConTex: Global Counterfactual Generation for Time Series Forecasting

ConTex reformulates counterfactual generation for time series forecasting as a globally consistent intervention problem. It achieves state-of-the-art validity with sparse, interpretable interventions, reduces computational cost by 12-36x, and enables real-time inference in approximately 0.007 seconds.

arxiv arXiv cs.LG · 8d ago

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

This paper proposes SD-ZFS, a deep reinforcement learning framework adapted from S2V-DQN, to solve the NP-hard minimum zero-forcing set problem on undirected graphs. The framework demonstrates strong performance compared to optimal solutions and greedy heuristics, showing effective generalization, scalability, and transfer across diverse graph structures.

arxiv arXiv cs.LG · 8d ago

LiL-Q: Convex Method for Nonlinear PDEs with PINNs

A new convex quasilinearization method, LiL-Q, solves nonlinear PDEs by reducing them to linear subproblems using physics-informed neural networks. LiL-Q converges in single-digit iterations across seven benchmarks, achieving machine precision when the exact solution lies in the trial space, and requires up to two orders of magnitude fewer parameters than standard PINN solvers.

arxiv arXiv cs.CL · 8d ago

Security and Privacy Prompts in User-LLM Conversations

A study of 14,727 security and privacy prompts from 3.2M real-world user-LLM conversations identifies nine categories of S&P queries. Commercial LLMs outperform open models, with GPT 5.5 providing good responses on 98% of prompts versus Llama 4 at 47%, though some commercial models produce contradictory responses across runs.

arxiv arXiv cs.LG · 8d ago

MGUP: Momentum-Gradient Alignment for Selective Optimization

MGUP introduces a selective update mechanism that applies larger step-sizes to a fixed proportion of parameters in stochastic optimization, while using smaller, non-zero step-sizes for the rest. It integrates seamlessly with optimizers like AdamW, Lion, and Muon, providing theoretical convergence guarantees for MGUP-AdamW and demonstrating superior or more stable performance in training large language models and MAE pretraining tasks.

arxiv arXiv cs.LG · 8d ago

SPHERE-JEPA: Family of Statistical Regularizers for Hypersphere

SPHERE-JEPA introduces deterministic statistical regularizers on the hypersphere, replacing stochastic sliced methods with analytically integrated objectives like MMD, KSD, and KL divergence. Rotationally invariant kernels based on heat and bandlimited filters ensure spatial bias-free learning, with empirical results showing improved convergence and performance on ImageNet and Galaxy10, and superior instance separation in procedural texture retrieval using KL divergence.

arxiv arXiv cs.LG · 8d ago

TUNEAHEAD Predicts Fine-tuning Performance Before Training

TUNEAHEAD is a lightweight framework that predicts fine-tuning performance using meta-feature vectors from dataset descriptors and short probe runs. It outperforms baselines like Early-Stop Extrapolation and ProxyLM, achieving an RMSE of 1.47 percentage points and 95.1% of predictions within ±3 percentage points of true scores on 370 held-out runs.

arxiv arXiv cs.LG · 8d ago

Confusion-Aware Transfer Teacher Curriculum Learning Framework

A confusion-aware difficulty score is introduced within the Transfer Teacher framework to improve model interpretability and data efficiency. Evaluations on CIFAR-10 show that confusion-aware curriculum ordering outperforms random ordering by up to 8.7% at 20% data, demonstrating consistent data-efficiency gains. However, curriculum or anti-curriculum ordering does not improve accuracy over standard training at full data, indicating that scoring function improvements alone are insufficient to overcome curriculum learning failure modes.

arxiv arXiv cs.LG · 8d ago

No-Free-Fairness: Fundamental Limits in Learning Systems

The paper introduces 'No-Free-Fairness' theorems that prove three fundamental limits in learning systems. These include inherent fairness-cost trade-offs, unavoidable subgroup disparity in finite samples, and model expressivity constraints that prevent fairness regardless of data. The results show fairness is constrained by problem structure, data limits, and model capacity, not just biased data.

arxiv arXiv cs.AI · 8d ago

Security and Privacy Prompts in User-LLM Conversations

A study of 14,727 security and privacy prompts from 3.2M real-world user-LLM conversations identifies nine categories of S&P questions. Thematic analysis and response testing show commercial LLMs outperform open models, with GPT 5.5 providing good responses on 98% of prompts versus Llama 4 at 47%, though some commercial models produce inconsistent responses across runs.

arxiv arXiv cs.AI · 8d ago

First Proof Second Batch: AI Tested on Research-Level Math Problems

A study evaluated several AI systems on ten research-level mathematics problems created by prominent mathematicians. The results include AI-generated solutions, human solutions, and referee reports, offering a detailed assessment of AI performance in solving advanced mathematical problems.

arxiv arXiv cs.CL · 8d ago

Can Language Models Discover Zero?

Language models of GPT-2 size cannot independently discover zero during testing, regardless of pretraining. However, performance improves significantly with training on tens to hundreds of zero examples, and language pretraining reduces required examples by about 50%.

arxiv arXiv cs.CL · 8d ago

Prompt Perturbation for Reliable LLM Evaluation

A new framework uses prompt perturbation to identify and filter structurally inconsistent pairwise comparisons in large language model evaluations. By incorporating graph-level consistency checks before ranking aggregation, the method reduces cyclic preferences and improves the reliability of LLM rankings.

arxiv arXiv cs.CL · 9d ago

A Framework for Evaluating Agentic Skills at Scale

We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.

arxiv arXiv cs.CL · 9d ago

Non-negative Elastic Net Decoding for Information Retrieval

NNN decoding selects documents as a joint set that jointly reconstructs the query embedding via a sparse non-negative linear combination. It strictly extends dense retrieval by handling queries that dense retrieval fails on, especially in corpora with correlated documents, and achieves superior performance through end-to-end training of embeddings.

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

Variability in AI-Generated Software: A New Product-Line Approach

Technical Taxonomy of LLM Agent Communication Protocols

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

No Jailbreak: Fable's 'Fix This Code' Was a Fake Scenario

ConTex: Global Counterfactual Generation for Time Series Forecasting

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

LiL-Q: Convex Method for Nonlinear PDEs with PINNs

Security and Privacy Prompts in User-LLM Conversations

MGUP: Momentum-Gradient Alignment for Selective Optimization

SPHERE-JEPA: Family of Statistical Regularizers for Hypersphere

TUNEAHEAD Predicts Fine-tuning Performance Before Training

Confusion-Aware Transfer Teacher Curriculum Learning Framework

No-Free-Fairness: Fundamental Limits in Learning Systems

Security and Privacy Prompts in User-LLM Conversations

First Proof Second Batch: AI Tested on Research-Level Math Problems

Can Language Models Discover Zero?

Prompt Perturbation for Reliable LLM Evaluation

A Framework for Evaluating Agentic Skills at Scale

Non-negative Elastic Net Decoding for Information Retrieval