Research paper — korshunov.ai

Research paper Page 1 / 16

PRIME: Evaluating Prompt Resolution in Conflicting Instructions

PRIME introduces a framework to analyze how large language models handle conflicting instructions by generating calibrated conflicts in response length, format, and reasoning. The study finds that conflict type has a greater impact on model behavior than model size, revealing diverse failure modes across conflict categories. Results highlight the need for conflict awareness and suggest instruction following cannot be reliably assessed through isolated benchmarks alone.

arxiv arXiv cs.AI · 17h ago

FACTOR Enables Adaptive Verification for Factuality in Long-Form Generation

FACTOR introduces an inference-time model that adapts verification criteria based on claim-level uncertainty. It improves factuality and reduces verification cost by dynamically allocating effort to high-risk claims, demonstrating effective and model-agnostic performance on the FactScore benchmark.

arxiv arXiv cs.AI · 17h ago

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering verifiable traces, auditability, and scalability over large datasets.

media r/LocalLLaMA · 17h ago

My micro-benchmark: how good are LLMs at simulating wetting behaviour?

The author benchmarks LLMs in simulating wetting behaviour using Surface Evolver, a 1992 tool for modeling liquid surfaces. LLMs are evaluated objectively by comparing their generated datafiles against reference implementations, with results showing pass counts and token costs for each model.

arxiv arXiv cs.AI · 17h ago

SCOPE: Self-Adaptive Symbolic Planning for Open-Ended Environments

SCOPE introduces a framework that refines action plans and evolves symbolic world models in open-ended environments. It combines a Symbolic Execution Simulator and a Self-Adaptive Symbolic Memory to improve plan completeness, perturbation resilience, and cross-task adaptability.

arxiv arXiv cs.AI · 17h ago

LLM-Orchestrated Agent for SOI Directional Coupler Design

A large language model orchestrates the design of a silicon-on-insulator 2x2 directional coupler by proposing gap values and assessing convergence. The design is validated through eigenmode and FDTD simulations on a common 2D effective-index model, showing a consistent phase offset of 2.837(11) micrometers that is corrected in a closed-loop process. The final device achieves a 50/50 split with a cross fraction of 0.498, within 0.0017 of the target.

lab Microsoft Research Blog · 18h ago

Talos: Automated Genomic Reanalysis for Rare Disease Diagnosis

Talos is an open-source tool that automates iterative reanalysis of genomic data to identify rare disease diagnoses. It achieved a 90% recovery rate of in-scope diagnoses with only 1.3 candidate variants per patient, and delivered 241 new diagnoses across 5,000 undiagnosed patients, with most new findings emerging within 32 days of evidence publication.

arxiv arXiv cs.AI · 18h ago

Prompt-Side Preprocessing Enhances Edge AI Accuracy

A structured prompt framework improves local LLM accuracy in environmental monitoring by transforming raw sensor data into enriched textual representations. Evaluations on indoor and outdoor datasets show local model accuracy increases from 50.9% to 81.7% indoors and 63.7% to 79.3% outdoors with enriched prompts, while maintaining low latency of nearly 0.22 seconds in no-chain-of-thought mode.

arxiv arXiv cs.AI · 19h ago

Grounded Scaling: Determinism as a Core Limit in Agentic AI

Agentic AI performance degrades exponentially in non-deterministic environments, with k-step success falling as δ^k when per-step determinism δ < 1. The paper introduces a framework linking environment determinism to task success, verifiability, and skill evolution, proposing a Supply Certainty Index and a five-level Determinism Maturity Model. It challenges prevailing views by identifying determinism as a binding constraint across compute, data, embodiment, and alignment.

arxiv arXiv cs.AI · 19h ago

Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation

Fed-CausalDiff introduces a federated causal diffusion framework that enables do-simulation in decentralized settings. It decomposes latent state evolution into global and local components, allowing decoupled synchronisation to reduce communication cost while maintaining accurate policy evaluation and ATE estimation.

arxiv arXiv cs.AI · 19h ago

Generative Robust Optimisation Framework

Generative Robust Optimisation (GRO) introduces a deep generative model to define uncertainty sets, capturing nonlinear correlations, asymmetry, and multimodality. A five-point evaluation framework assesses neural network-based uncertainty sets across reconstruction fidelity, distribution matching, latent regularity, robust relevance, and computational tractability, with experiments validating GRO's effectiveness in production planning and facility location.

arxiv arXiv cs.AI · 20h ago

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

CCPL introduces a lightweight framework that anchors class prompts to frozen concept prototypes, improving few-shot CLIP adaptation. It achieves better base-to-new performance on DTD and EuroSAT compared to CoOp, with consistent gains from text-space concept regularization, while maintaining neutrality on OxfordPets. The method uses concept dropout and controllable ensemble fusion at inference, with results sensitive to dataset semantics and protocol.

arxiv arXiv cs.AI · 20h ago

Context-Aware Distillation and Ablation for Text2DSL

A new Text2DSL system uses context-aware distillation with a structured context of BNF grammar, API specification, and closed identifier vocabulary. Ablation studies show that the vocabulary has the largest impact on semantic quality, while API and BNF significantly improve structural validity, confirming structured context as a critical, load-bearing component.

arxiv arXiv cs.AI · 20h ago

CWE-Level Generalisation in Syscall-Based HIDS

A one-class anomaly detector trained on normal behavior of CVEs sharing a CWE class can generalise to unseen CVEs within the same class, but effectiveness varies by CWE family. The CWE-307 detector achieves F1 = 0.6976 at 5% false positive rate, while CWE-89 and CWE-434 perform poorly, with F1 ≤ 0.21. Cross-CVE transfer is direction-dependent and driven more by the breadth of the source normal profile than the CWE category.

arxiv arXiv cs.AI · 21h ago

Text2DSL: LLM-Based Code Generation for Domain-Specific Languages

This paper introduces Text2DSL, a distinct task of generating domain-specific language code from natural language. Using the PolkitBench dataset of 4,204 validated pairs, it shows that structured context—such as BNF grammar and API specs—boosts syntactic and structural validity and CodeBLEU scores by 60% to 95% across different LLM models, without fine-tuning.

media r/LocalLLaMA · 21h ago

Baidu's Unlimited-OCR Transcribes Dozens of Pages in One Forward Pass

Baidu has released Unlimited-OCR, a model that transcribes dozens of pages in a single forward pass using Reference Sliding Window Attention (R-SWA). It builds on DeepSeek-OCR, inheriting its encoder, image compression, and MoE architecture, with only 500M active parameters per token. The model achieves 93.92% accuracy on OmniDocBench v1.6, outperforming DeepSeek-OCR's 87.01% on v1.5, though vendor-reported results warrant independent validation.

arxiv arXiv cs.AI · 21h ago

PaperClaw: Autonomous Research with Human-in-the-Loop Refinement

PaperClaw is a multi-agent system that autonomously conducts research from field selection to paper publication. It uses a validated, iterative propose-test-reflect loop, grounded in real references and runnable results, and supports human-in-the-loop refinement at any stage. Evaluation shows it produces strong papers both autonomously and with human oversight.

arxiv arXiv cs.LG · 21h ago

Optimal subsampling in RKHS for supervised learning

This paper proposes an optimal subsampling scheme in reproducing kernel Hilbert spaces, based on asymptotic analysis of an empirical risk minimizer with Horvitz-Thompson reweighting. The scheme, derived via the trace of the covariance operator, is shown to be implementable via plug-in and performs well on synthetic and real-world datasets.

arxiv arXiv cs.LG · 21h ago

TeaNet Improves Few-Shot Learning in Vibrational Spectroscopy

TeaNet, a task-enhanced augmentation network, reconstructs randomly masked spectra to generate augmented samples that preserve original spectral features while introducing domain-specific variations. This approach enables deep neural networks to identify discriminant wavenumbers more effectively, outperforming CNNs by 17% in challenging synthetic scenarios and offering improved interpretability in few-shot learning tasks.

arxiv arXiv cs.LG · 21h ago

Topological Neural Dynamics: Neuron-wise Sequence Modeling

Topological Neural Dynamics (TND) introduces a neuron-wise framework for sequence modeling, where each neuron evolves independently through a directed graph structure. In a single-player Pong behavior cloning task, TND achieves a mean of 17.47 consecutive catches per round, surpassing all baseline models by more than three times.