All articles — korshunov.ai

All articles Page 1 / 129

Systematic Benchmark of Lightweight Hallucination Detection Across QA, Dialogue, and Summarisation

This paper benchmarks five lightweight, CPU-feasible hallucination detection methods to provide practical alternatives for resource-constrained researchers who cannot use GPU-intensive or proprietary solutions. The study evaluates ROUGE-L, semantic similarity, BERTScore, a FEVER-trained DeBERTa NLI detector, and an ensemble of similarity and NLI across the HaluEval benchmark's question answering, dialogue, and summarisation tasks.

arxiv arXiv cs.CL · 6h ago

SrDetection: A Self-Referential Framework for Data Leakage Detection in Code LLMs

The authors introduce SrDetection, a unified framework for detecting data leakage in code large language models that operates in both gray-box and black-box settings. The method generates semantically equivalent variants of benchmark samples to identify cases where the original data is disproportionately easier for the model due to pre-training exposure.

arxiv arXiv cs.CL · 6h ago

Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

The paper introduces Neural Procedural Memory (NPM), a training-free framework that enables Large Language Model agents to utilize implicit activation steering for procedural memory instead of relying on explicit textual instructions. By distilling skills from historical experiences into steering vectors, NPM directly activates task-relevant neural mechanisms to guide execution.

arxiv arXiv cs.CL · 6h ago

Revealing the Technology Development of Natural Language Processing: A Scientific Entity-Centric Perspective

This study analyzes the development of technologies in Natural Language Processing (NLP) from an entity-centric perspective, extracting methods, datasets, metrics, and tools to measure their impact via co-occurrence networks. The research reveals that while pre-trained language models like BERT and Transformer have become mainstream, the average number of entities per paper is increasing, indicating a growing knowledge burden for researchers.

arxiv arXiv cs.CL · 6h ago

MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers

The authors propose MATCH, a framework that augments sparsified attention mechanisms with dynamically integrated in-context information to address the scalability bottlenecks of traditional attention in long-context scenarios.

arxiv arXiv cs.CL · 6h ago

Smooth Scaling Laws Hide Stepwise Token Learning

This study presents a token-level framework that decomposes language model scaling laws into localized learning events of individual contextualized tokens, challenging the view that heavy-tailed pattern difficulty is the sole cause.

arxiv arXiv cs.CL · 6h ago

Exploring Motivations for Algorithm Mention in NLP: A Deep Learning Approach

This study proposes a sentence-level framework to identify, analyze, and trace the evolution of motivations for mentioning algorithms in academic papers, using natural language processing as a case study. The researchers classify these motivations using pretrained models and data augmentation, revealing that deep learning models outperform traditional machine learning approaches.

arxiv arXiv cs.CL · 6h ago

KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration

The authors propose KbSD, a framework that addresses reward sparsity in agentic search by using dense token-level supervision and quadrant-adaptive optimization to calibrate when models should trust parametric memory versus retrieved evidence. This approach utilizes an information-asymmetric self-distillation process where a hint-augmented teacher generates calibrated reasoning demonstrations for a student model without requiring a larger external model.

arxiv arXiv cs.CL · 6h ago

ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation

The authors propose ARKD, a reinforcement-learning-based adaptive KL-weighted distillation framework that addresses the limitations of single KL objective methods in compressing Large Language Models. By using a policy network to dynamically assign weights to forward and reverse KL divergence based on teacher-student distributional characteristics, the method achieves dual alignment on principal and long-tail modes.

arxiv arXiv cs.CL · 6h ago

Timesteps of Mamba Align with Human Reading Times

A study demonstrates that the per-word processing time in the state-space language model Mamba aligns with human reading times. The research shows that Mamba's dynamic discretization timestep is a significant predictor of how long humans take to read words, even when controlling for other factors like GPT-2 surprisal.

arxiv arXiv cs.CL · 7h ago

Novelty Evolution in Chinese Library and Information Science Research

This study analyzes the distribution of novelty in Chinese Library and Information Science (LIS) papers published between 2000 and 2022, examining trends across journals, topics, and time periods. Using BERTopic for topic identification and combinatorial innovation theory for novelty scoring, the research investigates how collaboration patterns influence scholarly innovation.

arxiv arXiv cs.CL · 7h ago

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

This study introduces clinical reasoning graphs to evaluate the diagnostic reasoning patterns of large language models, revealing that while they achieve competence, they lack consistent reasoning schemas. The authors extracted structured graph representations from 750 traces across five LLMs and tested for stable reasoning patterns in clinically similar cases.

arxiv arXiv cs.CL · 7h ago

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

Researchers introduce SABER-Math, the first fully automated benchmark for evaluating mathematical information retrieval without expert annotation, addressing the difficulty of isolating retriever effects on downstream performance.

arxiv arXiv cs.CL · 7h ago

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

The article introduces MemDelta, a controlled evaluation protocol for agent memory systems that isolates individual components to prevent confounding variables from skewing results. Using the LongMemEval-S dataset with 500 questions across three model families, the study reveals that reported gains often mix changes in memory methods with variations in language models or retrieval pipelines.

arxiv arXiv cs.CL · 7h ago

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

This study investigates the reliability of using Large Language Models as judges for verifying rubrics in complex agentic scenarios, introducing RuVerBench as the first benchmark for this purpose. The research evaluates frontier models on deep research and coding tasks, revealing that while performance is strong, significant noise persists in verification.

arxiv arXiv cs.CL · 7h ago

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

This article proposes using thermodynamic phase-transition theory to understand the dynamics of language model alignment during post-training, specifically through the lens of material crystallization. The authors argue that this physical framework provides a principled vocabulary for reasoning about how models change and where alignment-induced structure originates.

arxiv arXiv cs.CL · 7h ago

ParametricSkills: Converting Textual Skills into LoRA Adapters

The authors propose ParametricSkills, a framework that converts free-form textual skills into parameters at test time by training a hypernetwork to generate LoRA adapters. This approach enables context-free skill exploitation, addressing the difficulty of adhering to instructions in complex scenarios.

arxiv arXiv cs.CL · 7h ago

Little Brains, Big Feats: Exploring Compact Language Models

This study investigates the performance of small language models during the generation stage within a Retrieval-Augmented Generation (RAG) system. The research benchmarks these models using diverse open-source and proprietary datasets to evaluate their effectiveness across various subject areas.

github llama.cpp · 7h ago

llama.cpp b9846 release with Vulkan matmul optimization for Asahi Linux

The llama.cpp project has released version b9846, which includes a Vulkan backend optimization for Asahi Linux. This update rolls back the block size loop in matrix multiplication to improve compatibility and performance on Apple Silicon hardware running Linux.

arxiv arXiv cs.CL · 8h ago

LatentRevise: Learning from Zero-Hit Reasoning

The paper introduces LatentRevise, a first-order latent revision method designed to recover training signals in reinforcement learning with verifiable rewards (RLVR) for prompts where correct trajectories are rarely sampled. By optimizing the input embeddings of a reasoning prefix based on failed rollouts and gold answers, the method generates useful data from previously unproductive attempts.