Research paper
arxiv arXiv cs.CL · 12h ago

Probing Self-Supervised Speech Representations on Mandarin Sub-dialects via Unsupervised Articulatory Analysis

This study investigates how internal phonetic representations in self-supervised speech models behave under fine-grained dialect variation, addressing the limitations of existing probing studies that rely on curated corpora. The authors present a case study using an entirely unlabeled probing pipeline for Mandarin sub-dialects. Phone sequences are generated via a language-agnostic universal phone recognizer and mapped to articulatory feature vectors, enabling frame-level probing without manual annotation. Results reveal structured patterns in articulatory feature decodability across different Mandarin dialects. Acoustically salient features like labiality and stridency remain comparatively stable, while those associated with finer spectral distinctions show larger dialect-dependent variation. This variation is driven primarily by elevated decodability for Beijing speech relative to other sub-dialects. Layer-wise analyses demonstrate distinct representational dynamics for these feature groups, suggesting uneven dialect sensitivity across articulatory dimensions.

arxiv arXiv cs.CL · 12h ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

The authors propose an end-to-end, fully differentiable neural architecture designed specifically for phoneme alignment to address the stagnation in this field compared to ASR advancements. The model features an encoder with two complementary branches dedicated to phoneme identity verification and boundary detection. A decoder implemented as a trainable module based on differentiable soft dynamic programming produces the final alignment decisions. The entire system is optimized using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. Experimental results show the approach outperforms current state-of-the-art methods on hand-annotated English benchmarks. Additionally, the model demonstrates strong word-level generalization capabilities and effective performance on unseen languages.

arxiv arXiv cs.CL · 12h ago

Fine-Tuned PEGASUS Achieves State-of-the-Art Performance on XL-Sum English Corpus

This paper presents a method for optimizing abstractive text summarization by fine-tuning the PEGASUS model on the XL-Sum English corpus. The objective is to surpass the performance of the baseline mT5 model in generating concise summaries that capture salient ideas without merely extracting sentences. The generated summaries are evaluated using the ROUGE metric, which compares auto-generated outputs against human-created references. The study claims that the fine-tuned PEGASUS model achieves state-of-the-art results on this specific dataset. Quantitative analysis reveals a 4.04% improvement in the ROUGE-1 score compared to the baseline. Additionally, the model demonstrates a significant 15.25% increase in the ROUGE-2 score. Finally, there is a reported 3.39% improvement in the ROUGE-L score, confirming the effectiveness of the fine-tuning approach.

arxiv arXiv cs.CL · 13h ago

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the concept of cliff tokens to identify specific single-token failure triggers in large language models during mathematical reasoning tasks. Unlike prior work that analyzes failures at step or sentence levels, this method pinpoints the exact token where potential drops significantly using an adaptive threshold based on a z-test. The study evaluates seven models across three benchmarks: GSM1K, MATH500, and AIME 2025. Deleting the first cliff token and resampling allows recovery of pass@64 to 1.0, whereas keeping it limits recovery between 0.71 and 1.00. The authors propose a taxonomy classifying cliffs as deterministic, uncertain, or sampled-off based on greedy choice and token entropy. This classification generalizes across different model scales and exhibits distinct probabilistic characteristics for each type. Furthermore, the team validates this taxonomy through single-token preference optimization known as Cliff-DPO. Trained on GSM8K, Cliff-DPO improves accuracy by up to +6.6 across benchmarks. Optimization proves effective for uncertain and sampled-off cliffs but yields no improvement for deterministic ones.

arxiv arXiv cs.CL · 13h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 13h ago

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

Recent large language models demonstrate strong mathematical reasoning, but these gains rely heavily on English-centric resources, leaving low-resource languages like Urdu with limited capabilities. To address this gap, researchers developed Riazi-8B, an Urdu model designed specifically for multi-step mathematical problem solving. The model was created through a two-step adaptation process involving continued pre-training on Urdu Wikipedia and supervised fine-tuning on Urdu Chain-of-Thought data derived from GSM8K. Evaluation of Riazi-8B was conducted on the MGSM-Urdu benchmark against existing Urdu instruction-tuned models. The results showed consistent improvements in answer correctness, reasoning quality, response completeness, and Urdu generation compared to baselines. These findings demonstrate that combining Urdu language adaptation with reasoning-focused fine-tuning effectively extends mathematical reasoning capabilities to low-resource languages.

arxiv arXiv cs.CL · 14h ago

Constraint Tax in Open-Weight LLMs: Tool Calling Suppression Under Structured Output Constraints

This study identifies a phenomenon called Tool Suppression, where open-weight language models cease invoking tools when JSON Schema constraints are simultaneously enabled. The authors observed this behavior in a production Agent system and reproduced it through controlled experiments across multiple model families. While tool execution and schema compliance function correctly when evaluated independently, they fail under joint deployment conditions. Analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, rendering tool-call tokens unreachable during decoding. To interpret these findings, the paper proposes the Constraint Priority Inversion hypothesis, suggesting schema satisfaction dominates action selection under simultaneous constraints. The authors mitigate this issue by introducing Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from response generation. This approach restores tool invocation while preserving structured output guarantees without requiring model retraining. The research highlights that evaluating capabilities separately may overlook critical reliability issues in production systems.

arxiv arXiv cs.CL · 14h ago

REVERIEMEM: Perspective-Bounded Memory for Book-Based Role-Playing Agents

Recent large language model role-playing systems often fail in long-narrative contexts due to factual overreach and stylistic monotony. Factual overreach occurs when characters access information outside their narrative perspective, while stylistic monotony flattens character voices through static profile descriptions. To address these issues, the authors propose REVERIEMEM, a three-layer memory architecture designed for book-based character agents. This system utilizes an episodic layer for first-person scene memories, a semantic layer for visibility-tagged facts, and a personality layer for situation-dependent behavioral patterns. The researchers also introduce KBF-QA, a benchmark consisting of 4,386 questions across eight novels to test knowledge boundaries. Experimental results show that REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points compared to prior methods. Additionally, the model achieves approximately a 79% win rate on BOOKWORLD's five-dimension pairwise narrative protocol. These findings suggest that perspective-bounded memory effectively enhances both factual accuracy and character-grounded narrative generation.

arxiv arXiv cs.CL · 14h ago

Framework Evaluates When GraphRAG and Agentic RAG Are Needed

The authors introduce a framework for evaluating and comparing regular, GraphRAG, Modular, and Agentic Retrieval-Augmented Generation (RAG) on semi-structured knowledge bases. They implement nine standardized scenarios spanning simple document retrieval to complex hybrid text-graph integration and agentic multi-step planning. A novel context engineering method is presented to address memory overflow issues in advanced RAG variants through new representations and agentic loop design. This optimization achieves a 19% to 53% reduction in token usage while efficiently managing retrievals. Further analysis reveals a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality. The study suggests that current retrieval-oriented metrics may overstate the benefits of advanced retrieval techniques. These data-driven insights aim to guide the development of production-ready intelligent RAG systems.

arxiv arXiv cs.CL · 14h ago

BITEMBED: Extreme Low-Bit Framework for LLM-Based Text Embeddings

The paper introduces BITEMBED, an extreme low-bit framework designed to address the high deployment costs of LLM-based text embedders by targeting both encoding efficiency and vector storage. The method converts pretrained LLM backbones into BitNet-style encoders featuring ternary weights, quantized activations, and lightweight normalization refinement. To adapt these models for representation learning, BITEMBED employs continual contrastive pre-training followed by supervised contrastive fine-tuning. This fine-tuning process utilizes similarity-distribution distillation and attention-relation distillation from a full-precision teacher model. Beyond backbone quantization, the framework trains output embeddings to support multiple storage precisions, allowing for flexible trade-offs between performance and storage costs. Experiments on the MMTEB benchmark using Qwen3-0.6B and Gemma3-270M demonstrate that BITEMBED performs largely comparably to full-precision teacher embedders.

arxiv arXiv cs.CL · 15h ago

Space-Efficient Language Generation in the Limit

This study initiates a resource-aware theory of language generation in the limit under space efficiency constraints. A learner observes an adversarial positive stream from a target language K and must output a hallucination-free hypothesis L while omitting at most Δ strings. The research focuses on DFAs with s states over an alphabet of size k as the hypothesis class for memory-bounded learners. In the exponential-space regime, the authors prove that a learner can exactly identify the target language K. Under stricter memory budgets, they present a streaming algorithm using poly(s,k) space that converges to a hypothesis with a generation gap of Δ= O(k^{2s-2}). This learned hypothesis captures every string in K of length at least 2s-1. The results are complemented by a near-matching lower bound derived from communication complexity, showing that achieving Δ≤ k^{(1-ε)s} requires k^{Ω(εs)} memory. These findings reveal a sharp transition between polynomial-space generation and exponential-space exact identification.

arxiv arXiv cs.CL · 16h ago

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Sparse Mixture-of-Experts (MoE) architectures often struggle with low-resource languages due to cross-lingual routing divergence that limits expert sharing. To address this, researchers propose SARA, a framework that transfers specialized capabilities from high-resource anchor languages to low-resource ones. SARA aligns the internal routing distributions of MoE layers using a symmetric Jensen-Shannon divergence constraint rather than operating on output logits. This approach encourages mechanistic consistency in expert selection across different languages. The authors evaluated the method on two large language models across five low-resource languages and three benchmarks. Results show SARA outperforms standard instruction tuning, achieving gains of +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct for Global-MMLU. These findings demonstrate that SARA effectively addresses performance bottlenecks in low-resource contexts.

arxiv arXiv cs.LG · 16h ago

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

The paper introduces HRLLI, a hierarchical reinforcement learning framework designed to improve sample efficiency by leveraging natural-language instructions. It addresses the limitation of existing approaches that treat instructions as static inputs, failing to account for their stage-dependent relevance in complex environments. The proposed method decomposes instructions into piecewise guidance elements that become relevant at different interaction stages. A novel Select-to-Act paradigm is formulated where a high-level semantic policy acts as a selector for the most relevant instruction piece based on the current state. This selected guidance conditions a low-level policy that executes environment actions, with both policies learned simultaneously to maximize augmented expected returns. Experiments on the RTFM benchmark demonstrate that HRLLI consistently outperforms strong instruction-conditioned RL baselines. The results confirm that explicitly modeling adaptive instruction selection significantly enhances reinforcement learning effectiveness.

arxiv arXiv cs.LG · 16h ago

SAFER: Reliability-Guided Adaptive Ensembling for Robust Test-Time Adaptation

The authors address the brittleness of test-time adaptation (TTA) under adversarially contaminated streams by proposing SAFER, a training-free framework for robust TTA. SAFER acts as an augmentation wrapper that replaces single-view predictions with a reliability-guided pooled predictor to stabilize online updates. For each test sample, the method generates stochastic augmentations and aggregates their outputs using correlation-weighted pooling combined with outlier detection. An adaptive-mixing extension is also introduced, which adjusts the weighting between original and augmented inputs based on feature disagreement signals to preserve clean performance. The researchers evaluated SAFER on PACS, VLCS, and OfficeHome benchmarks under PGD attacks at various rates. Results indicate that SAFER improves the resilience of TTA methods against adversarial attacks while maintaining competitive accuracy on clean data.

arxiv arXiv cs.LG · 16h ago

Parsimoniously Activated Dictionary Learning Links Sparsity and Storage to Generative Models

The paper introduces parsimoniously activated dictionary learning (PADL), a method imposing global regularization on the number of activated dictionary atoms. It demonstrates that PADL is equivalent to maximum a posteriori estimation under a structured generative model with auxiliary latent variables. This equivalence enables the derivation of generalization guarantees that are difficult to obtain from the original formulation. The authors provide an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy. This framework allows for data-driven estimation of optimal hyperparameters without manual tuning. An efficient and interpretable PADL algorithm is developed based on this theoretical connection. Experimental results show improved reconstruction performance under comparable sparsity levels on visual benchmarks. The method also demonstrates practical utility in accelerating inference for vision-language models.

arxiv arXiv cs.LG · 16h ago

Multigrid Training for Molecular Generation using Graph Neural Networks

The authors introduce a multigrid training strategy to address the high computational costs and instability associated with modeling biochemical molecular systems at full resolution. This approach leverages low-resolution optimization to accelerate learning at higher resolutions by transferring parameters across different discretizations. For graph-based molecular representations, the method progressively transfers parameters from a coarse graph to increasingly finer graphs using biased random walk upsampling. In 3D molecular generation, structures are voxelized at multiple resolutions, allowing a coarse-resolution conditional Variational Autoencoder (CVAE) to be pretrained first. Shape-compatible convolutional parameters are then transferred from the coarse model to initialize a fine-resolution CVAE. Numerical experiments on receptor-conditioned 3D ligand generation demonstrate that this method accelerates convergence compared to training from scratch. Additionally, the study shows that multigrid training improves generalization capabilities for molecular generation tasks.

arxiv arXiv cs.LG · 17h ago

HyperAdapter: Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers

The authors propose HyperAdapter, a novel parameter-efficient fine-tuning method that adapts vision transformers in hyperedge space rather than token space. Existing adapter-based methods typically perform independent adaptations for each token, which overlooks structured relationships and can lead to redundant updates. HyperAdapter constructs a soft hypergraph over ViT tokens using prototype-based assignments to enable group-aware adaptation. The architecture aggregates token features into latent hyperedge representations and applies lightweight bottleneck adaptation at the hyperedge level. Updates are then diffused back to individual tokens via the hypergraph incidence structure, injecting an explicit structural inductive bias. Extensive experiments across diverse visual benchmarks demonstrate that this approach consistently outperforms strong PEFT baselines under comparable parameter budgets. The results highlight significant gains on tasks requiring structured reasoning and suggest that the choice of adaptation space is a critical dimension for efficient transfer.

arxiv arXiv cs.LG · 17h ago

Shift-Invariant Variance Estimator Eliminates Minimization Bias in Local Learning Coefficient Estimation

Singular Learning Theory uses the Local Learning Coefficient to quantify neural network loss landscape geometry, but mean-energy estimators rely on an additive loss baseline. During off-equilibrium training phases, this minimum is unknown, and substituting it with noisy mini-batch losses introduces systematic minimization bias. The authors propose the Shift-Invariant Variance Estimator (SIVE) to structurally eliminate this unknown baseline through the variance operator. By combining SIVE with a correction derived from the Law of Total Variance, the method separates geometric loss fluctuations from evaluation noise. Controlled experiments on analytically tractable toy models demonstrate that SIVE recovers expected finite-temperature geometric signals where anchored mean estimators fail. Applied to deep neural networks, SIVE serves as a robust diagnostic for tracking structural phase transitions throughout training.

arxiv arXiv cs.LG · 17h ago

Efficient CNN with Transfer Learning for Multi-Cancer Detection

A study introduces a lightweight convolutional neural network enhanced with transfer learning for multi-cancer detection using biomedical images. The architecture aims to reduce computational complexity while maintaining high classification performance for deployment in resource-constrained environments. Researchers evaluated the model on three tumor datasets comprising brain MRI and lung and kidney CT scans. The system achieved test accuracies of 90.85%, 98.64%, and 99.92% for brain, lung, and kidney cancer respectively via five-fold stratified cross-validation. Transfer learning was employed by pretraining on one cancer type and fine-tuning on others, requiring only 20 additional epochs to match scratch-trained models. The fine-tuning process updates the classification part of the CNN and takes approximately 0.014 seconds per image per epoch on an NVIDIA GeForce GTX 960. Comparative evaluations demonstrate that this model outperforms state-of-the-art architectures such as Xception, VGG16, VGG19, MobileNetV2, and DenseNet121.

arxiv arXiv cs.LG · 17h ago

P4IR: Reinforcement Learning Enhances Automated Code Compliance Systems

A new framework named P4IR addresses the issue of hallucinated rules in large language model-based automated code compliance systems. This two-stage approach first employs supervised fine-tuning to instill domain knowledge into the model. It then utilizes Group Relative Policy Optimization to improve the accuracy of generated high-level code skeletons. The method achieved reductions of up to 23.8% in tree edit distance and 38.6% in token-level Levenshtein distance compared to supervised fine-tuning baselines. Comparative analysis shows that P4IR outperforms leading models like Claude Opus, GPT-5.2, and Qwen-3-Max in zero-shot settings. Additionally, the reinforcement learning stage produced a statistically significant reduction in false positives. This combination of techniques offers a path toward more reliable automated code compliance.