Research paper — korshunov.ai

Research paper Page 1 / 16

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the concept of cliff tokens to identify specific single-token failure triggers in large language models during mathematical reasoning tasks. Unlike prior work that analyzes failures at step or sentence levels, this method pinpoints the exact token where potential drops significantly using an adaptive threshold based on a z-test. The study evaluates seven models across three benchmarks: GSM1K, MATH500, and AIME 2025. Deleting the first cliff token and resampling allows recovery of pass@64 to 1.0, whereas keeping it limits recovery between 0.71 and 1.00. The authors propose a taxonomy classifying cliffs as deterministic, uncertain, or sampled-off based on greedy choice and token entropy. This classification generalizes across different model scales and exhibits distinct probabilistic characteristics for each type. Furthermore, the team validates this taxonomy through single-token preference optimization known as Cliff-DPO. Trained on GSM8K, Cliff-DPO improves accuracy by up to +6.6 across benchmarks. Optimization proves effective for uncertain and sampled-off cliffs but yields no improvement for deterministic ones.

arxiv arXiv cs.CL · 5h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 6h ago

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

Recent large language models demonstrate strong mathematical reasoning, but these gains rely heavily on English-centric resources, leaving low-resource languages like Urdu with limited capabilities. To address this gap, researchers developed Riazi-8B, an Urdu model designed specifically for multi-step mathematical problem solving. The model was created through a two-step adaptation process involving continued pre-training on Urdu Wikipedia and supervised fine-tuning on Urdu Chain-of-Thought data derived from GSM8K. Evaluation of Riazi-8B was conducted on the MGSM-Urdu benchmark against existing Urdu instruction-tuned models. The results showed consistent improvements in answer correctness, reasoning quality, response completeness, and Urdu generation compared to baselines. These findings demonstrate that combining Urdu language adaptation with reasoning-focused fine-tuning effectively extends mathematical reasoning capabilities to low-resource languages.

arxiv arXiv cs.CL · 6h ago

Constraint Tax in Open-Weight LLMs: Tool Calling Suppression Under Structured Output Constraints

This study identifies a phenomenon called Tool Suppression, where open-weight language models cease invoking tools when JSON Schema constraints are simultaneously enabled. The authors observed this behavior in a production Agent system and reproduced it through controlled experiments across multiple model families. While tool execution and schema compliance function correctly when evaluated independently, they fail under joint deployment conditions. Analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, rendering tool-call tokens unreachable during decoding. To interpret these findings, the paper proposes the Constraint Priority Inversion hypothesis, suggesting schema satisfaction dominates action selection under simultaneous constraints. The authors mitigate this issue by introducing Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from response generation. This approach restores tool invocation while preserving structured output guarantees without requiring model retraining. The research highlights that evaluating capabilities separately may overlook critical reliability issues in production systems.

arxiv arXiv cs.CL · 6h ago

REVERIEMEM: Perspective-Bounded Memory for Book-Based Role-Playing Agents

Recent large language model role-playing systems often fail in long-narrative contexts due to factual overreach and stylistic monotony. Factual overreach occurs when characters access information outside their narrative perspective, while stylistic monotony flattens character voices through static profile descriptions. To address these issues, the authors propose REVERIEMEM, a three-layer memory architecture designed for book-based character agents. This system utilizes an episodic layer for first-person scene memories, a semantic layer for visibility-tagged facts, and a personality layer for situation-dependent behavioral patterns. The researchers also introduce KBF-QA, a benchmark consisting of 4,386 questions across eight novels to test knowledge boundaries. Experimental results show that REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points compared to prior methods. Additionally, the model achieves approximately a 79% win rate on BOOKWORLD's five-dimension pairwise narrative protocol. These findings suggest that perspective-bounded memory effectively enhances both factual accuracy and character-grounded narrative generation.

arxiv arXiv cs.CL · 6h ago

Framework Evaluates When GraphRAG and Agentic RAG Are Needed

The authors introduce a framework for evaluating and comparing regular, GraphRAG, Modular, and Agentic Retrieval-Augmented Generation (RAG) on semi-structured knowledge bases. They implement nine standardized scenarios spanning simple document retrieval to complex hybrid text-graph integration and agentic multi-step planning. A novel context engineering method is presented to address memory overflow issues in advanced RAG variants through new representations and agentic loop design. This optimization achieves a 19% to 53% reduction in token usage while efficiently managing retrievals. Further analysis reveals a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality. The study suggests that current retrieval-oriented metrics may overstate the benefits of advanced retrieval techniques. These data-driven insights aim to guide the development of production-ready intelligent RAG systems.

arxiv arXiv cs.CL · 6h ago

BITEMBED: Extreme Low-Bit Framework for LLM-Based Text Embeddings

The paper introduces BITEMBED, an extreme low-bit framework designed to address the high deployment costs of LLM-based text embedders by targeting both encoding efficiency and vector storage. The method converts pretrained LLM backbones into BitNet-style encoders featuring ternary weights, quantized activations, and lightweight normalization refinement. To adapt these models for representation learning, BITEMBED employs continual contrastive pre-training followed by supervised contrastive fine-tuning. This fine-tuning process utilizes similarity-distribution distillation and attention-relation distillation from a full-precision teacher model. Beyond backbone quantization, the framework trains output embeddings to support multiple storage precisions, allowing for flexible trade-offs between performance and storage costs. Experiments on the MMTEB benchmark using Qwen3-0.6B and Gemma3-270M demonstrate that BITEMBED performs largely comparably to full-precision teacher embedders.

arxiv arXiv cs.CL · 8h ago

Space-Efficient Language Generation in the Limit

This study initiates a resource-aware theory of language generation in the limit under space efficiency constraints. A learner observes an adversarial positive stream from a target language K and must output a hallucination-free hypothesis L while omitting at most Δ strings. The research focuses on DFAs with s states over an alphabet of size k as the hypothesis class for memory-bounded learners. In the exponential-space regime, the authors prove that a learner can exactly identify the target language K. Under stricter memory budgets, they present a streaming algorithm using poly(s,k) space that converges to a hypothesis with a generation gap of Δ= O(k^{2s-2}). This learned hypothesis captures every string in K of length at least 2s-1. The results are complemented by a near-matching lower bound derived from communication complexity, showing that achieving Δ≤ k^{(1-ε)s} requires k^{Ω(εs)} memory. These findings reveal a sharp transition between polynomial-space generation and exponential-space exact identification.

arxiv arXiv cs.CL · 8h ago

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Sparse Mixture-of-Experts (MoE) architectures often struggle with low-resource languages due to cross-lingual routing divergence that limits expert sharing. To address this, researchers propose SARA, a framework that transfers specialized capabilities from high-resource anchor languages to low-resource ones. SARA aligns the internal routing distributions of MoE layers using a symmetric Jensen-Shannon divergence constraint rather than operating on output logits. This approach encourages mechanistic consistency in expert selection across different languages. The authors evaluated the method on two large language models across five low-resource languages and three benchmarks. Results show SARA outperforms standard instruction tuning, achieving gains of +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct for Global-MMLU. These findings demonstrate that SARA effectively addresses performance bottlenecks in low-resource contexts.

media r/LocalLLaMA · 11h ago

Colony: An Educational Simulation of LLM Attention Mechanisms Using Agent-Based Analogies

Colony is an educational resource designed to explain the attention mechanism of Large Language Models through simple analogies involving agents. The simulation places these agents within a board environment inspired by Conway's Game of Life. Each agent in the system represents a specific role within the self-attention block mechanism of an LLM. This visual approach allows users to observe how information flows and interacts during the attention process. The project is available as an open-source tool for those interested in exploring these concepts without complex mathematics. It serves as a fun and accessible way to understand the internal workings of transformer models.

arxiv arXiv cs.LG · 15h ago

Scalable Bayesian Models for Stellar Flare Detection

A generative surrogate framework using a Variational Autoencoder approximates Gaussian Process priors, bypassing costly covariance operations. The VAE+Hidden Markov Model architecture enables fast, scalable stellar flare detection in large astronomical time series, matching exact models in structural fidelity while reducing computational time significantly.

arxiv arXiv cs.AI · 16h ago

Geometry-Aware Online Scheduling for LLM Serving

A new scheduling algorithm, Smallest Volume First (SVF), reduces LLM inference latency by optimizing key-value cache management. Theoretical analysis shows a worst-case competitive ratio reduced from 48 to 5, with 1-bit SVF achieving strong performance using minimal information. Evaluations on Llama-3.1 models confirm improvements in both average and tail latency, with the approach integrated into vLLM.

arxiv arXiv cs.AI · 16h ago

Hypothesis-Driven Skill Optimization for LLM Agents

HDSO enables safe, auditable skill updates for LLM agents without training, using falsifiable hypotheses and validation. On ALFWorld, it improves Qwen3-8B by +6.9 Avg. SR points and maintains a +7.1-point gain under noisy feedback, with validated skills transferable across runs and models when diagnostic alignment is achieved.

arxiv arXiv cs.AI · 16h ago

Flow Annealing Posterior Sampling for Function-Space Regression and Inverse Problems

FAPS is the first function-space posterior sampling framework that unifies stochastic-process regression and PDE inverse problems. It uses pretrained flow-matching priors and Langevin correction with low-rank covariance preconditioning to enable efficient, accurate posterior inference from sparse, noisy data with coherent uncertainty quantification.

arxiv arXiv cs.AI · 16h ago

Select-to-Act: Hierarchical RL with Adaptive Language Guidance

HRLLI introduces a hierarchical reinforcement learning framework that adapts natural-language instructions dynamically during decision-making. It decomposes instructions into stage-specific guidance elements and uses a select-to-act paradigm to enable real-time selection of relevant instruction pieces, improving sample efficiency and performance in complex environments.

arxiv arXiv cs.AI · 16h ago

SAFER: Reliable Test-Time Adaptation under Adversarial Streams

SAFER is a training-free framework that enhances robustness of test-time adaptation by using reliability-guided augmentation. It generates stochastic augmentations, pools predictions via correlation-weighted aggregation with outlier detection, and includes adaptive mixing to preserve clean performance under adversarial attacks. Evaluations on PACS, VLCS, and OfficeHome show improved resilience without sacrificing clean accuracy.

arxiv arXiv cs.AI · 16h ago

Sparsity-Storage-Accuracy Tradeoff in Parsimoniously Activated Dictionary Learning

Parsimoniously activated dictionary learning (PADL) establishes a structured generative model with auxiliary latent variables, enabling maximum a posteriori estimation. This framework provides generalization guarantees and an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy, allowing data-driven hyperparameter estimation. The resulting algorithm achieves better reconstruction performance and accelerates inference in vision-language models.

arxiv arXiv cs.AI · 16h ago

First-Token Broadcasters in Transformers: Language Identity and Robustness

LIHA reveals a small set of first-token broadcaster heads in GPT-2 that persistently attend to the initial prompt token, driving language switches. Instruction tuning reorganizes these circuits, concentrating language identity at early layers, as seen in Qwen2.5-1.5B-Instruct and confirmed in Chinese and Russian language handling at layer 0.

arxiv arXiv cs.AI · 17h ago

ARIA: A Causal-Aware Framework for Rescuing LLM Reasoning

ARIA addresses contextual tunneling in LLMs by conditioning knowledge use on mechanistic completeness. It uses a three-tier cascade for causal reasoning, physics-informed transfer, and parametric fallback, and improves materials discovery through auditable, physically grounded reasoning.

arxiv arXiv cs.AI · 17h ago

HyperAdapter: Structured Hyperedge Adaptation for Vision Transformer Fine-Tuning

HyperAdapter introduces a hypergraph-based adapter that performs structured, group-aware adaptation in vision transformers by operating in hyperedge space rather than token space. It uses prototype-based assignments to build a soft hypergraph, aggregates token features into hyperedge representations, applies lightweight adaptation, and diffuses updates back via hypergraph structure, enabling explicit structural inductive bias while maintaining efficiency. Experiments show consistent performance gains over baseline PEFT methods, especially on tasks requiring structured reasoning.