All articles — korshunov.ai

All articles Page 1 / 127

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

This study introduces clinical reasoning graphs to evaluate the diagnostic reasoning patterns of large language models, revealing that while they achieve competence, they lack consistent reasoning schemas. The authors extracted structured graph representations from 750 traces across five LLMs and tested for stable reasoning patterns in clinically similar cases.

arxiv arXiv cs.CL · 4h ago

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

Researchers introduce SABER-Math, the first fully automated benchmark for evaluating mathematical information retrieval without expert annotation, addressing the difficulty of isolating retriever effects on downstream performance.

arxiv arXiv cs.CL · 4h ago

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

The article introduces MemDelta, a controlled evaluation protocol for agent memory systems that isolates individual components to prevent confounding variables from skewing results. Using the LongMemEval-S dataset with 500 questions across three model families, the study reveals that reported gains often mix changes in memory methods with variations in language models or retrieval pipelines.

arxiv arXiv cs.CL · 4h ago

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

This study investigates the reliability of using Large Language Models as judges for verifying rubrics in complex agentic scenarios, introducing RuVerBench as the first benchmark for this purpose. The research evaluates frontier models on deep research and coding tasks, revealing that while performance is strong, significant noise persists in verification.

arxiv arXiv cs.CL · 4h ago

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

This article proposes using thermodynamic phase-transition theory to understand the dynamics of language model alignment during post-training, specifically through the lens of material crystallization. The authors argue that this physical framework provides a principled vocabulary for reasoning about how models change and where alignment-induced structure originates.

arxiv arXiv cs.CL · 4h ago

ParametricSkills: Converting Textual Skills into LoRA Adapters

The authors propose ParametricSkills, a framework that converts free-form textual skills into parameters at test time by training a hypernetwork to generate LoRA adapters. This approach enables context-free skill exploitation, addressing the difficulty of adhering to instructions in complex scenarios.

arxiv arXiv cs.CL · 4h ago

Little Brains, Big Feats: Exploring Compact Language Models

This study investigates the performance of small language models during the generation stage within a Retrieval-Augmented Generation (RAG) system. The research benchmarks these models using diverse open-source and proprietary datasets to evaluate their effectiveness across various subject areas.

github llama.cpp · 4h ago

llama.cpp b9846 release with Vulkan matmul optimization for Asahi Linux

The llama.cpp project has released version b9846, which includes a Vulkan backend optimization for Asahi Linux. This update rolls back the block size loop in matrix multiplication to improve compatibility and performance on Apple Silicon hardware running Linux.

arxiv arXiv cs.CL · 5h ago

LatentRevise: Learning from Zero-Hit Reasoning

The paper introduces LatentRevise, a first-order latent revision method designed to recover training signals in reinforcement learning with verifiable rewards (RLVR) for prompts where correct trajectories are rarely sampled. By optimizing the input embeddings of a reasoning prefix based on failed rollouts and gold answers, the method generates useful data from previously unproductive attempts.

arxiv arXiv cs.CL · 5h ago

Know Before You Fetch: Calibrated Retrieval-Budget Allocation for Retrieval-Augmented Generation

This article introduces an adaptive RAG framework that allocates retrieval budgets by calibrating sequence log-probability and prefix-logit uncertainty signals into probabilities of correctness. The system decides whether to answer closed-book, retrieve a compact context (k=1), retrieve a full context (k=5), or abstain based on these calibrated probabilities.

arxiv arXiv cs.CL · 5h ago

IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

IHDec addresses the failure of Large Language Models to maintain instruction hierarchies in multi-turn contexts by leveraging Jensen-Shannon Divergence to detect and correct role-influence inversions. This training-free method dynamically suppresses subordinate roles that override superior directives during token generation.

arxiv arXiv cs.CL · 5h ago

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

This study introduces approach-level diversity to address the gap between surface-level variation and actual strategic differences in LLM mathematical reasoning. It demonstrates that prior metrics fail to capture true methodological diversity, leading to a decline in approach-level diversity during diversity-aware RLVR training.

arxiv arXiv cs.CL · 5h ago

VISTA: A Proprioceptive Dashboard for LLM Context Management

The article introduces VISTA, a training-free layer designed to address the context window limitations of long-horizon tool agents by exposing their internal state. It argues that frontier models are blind to their own context usage and proposes an interface that surfaces working memory details rather than relying on learned compression policies.

arxiv arXiv cs.CL · 5h ago

Node-to-Neighborhood Semantic Consistency: Text-Topology Alignment for TAGs Anomaly Detection

This paper addresses graph anomaly detection on text-attributed graphs by formalizing it as a node-to-neighborhood semantic consistency problem, where anomalies stem from mismatches between textual semantics and topological relationships. The authors propose N2NSC, a framework that uses two complementary fusion paths to align graph topology with textual semantics, enabling large language models to leverage both structural and textual neighborhood information.

arxiv arXiv cs.CL · 5h ago

SHOVIR: A Benchmark for Evaluating Vision Shortcut Learning in Radiology Report Generation

The SHOVIR benchmark evaluates vision shortcut learning in radiology report generation by extending MIMIC-CXR and PadChest-GR with per-box CheXpert labels. It utilizes image-level and disease-level occlusion experiments to isolate direct and contextual shortcuts where models rely on spurious correlations rather than actual visual evidence.

github llama.cpp · 5h ago

llama.cpp b9844 release adds NVFP4 support and new binaries

The llama.cpp project has released version b9844, which introduces ggml-webgpu support for the NVFP4 quantization format. This update also provides pre-built binaries for macOS, iOS, Linux, Android, Windows, and openEuler across various hardware backends.

arxiv arXiv cs.CL · 6h ago

Not-quite-human tastes: the stylized omnivorousness of LLM survey surrogates

This study evaluates the ability of large-language models to approximate human cultural tastes by generating silicon surrogates from the Survey of Public Participation in the Arts. Using models from OpenAI, Anthropic, and DeepSeek, the authors analyze 277,470 synthetic respondents to determine if LLMs can faithfully replicate real-world survey data.

arxiv arXiv cs.CL · 6h ago

Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs

Researchers propose TIGRAG (Token-Induced GraphRAG), a framework that uses token co-occurrence statistics to build scalable knowledge graphs for efficient retrieval-augmented generation. This approach addresses the limitations of standard RAG in multi-hop reasoning by avoiding expensive LLM-based extraction pipelines.

arxiv arXiv cs.CL · 6h ago

Information Dynamics of Language Communication

Researchers introduce an information-theoretic framework to quantify the directed flow of semantic content between interlocutors and decompose multi-source contributions into redundant, unique, and synergistic components.

arxiv arXiv cs.CL · 6h ago

Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters

This study investigates whether verbose chain-of-thought prompting improves large language model reasoning through increased computation or by providing useful semantic content. The authors present evidence from in-distribution sampling and controlled interventions to determine the specific factors driving performance gains.