Research paper
arxiv arXiv cs.CL · 12h ago

Introducing corpora Hlava Cor and Hlava AD: Human Label Variation in Coreference and Discourse Relations

Researchers have created two new corpora, Hlava Cor and Hlava AD, to explore human variation in understanding text coherence. These resources contain multiple annotations of Czech texts along with annotators' explanations for their choices. The first corpus, Hlava Cor, consists of 1,024 contexts annotated by three individuals to capture coreference identification differences. It covers pronouns, full noun phrases, and anaphoric adverbials across various text types and grammatical-semantic categories. The second corpus, Hlava AD, comprises 512 contexts annotated by five annotators focusing on discourse relations in attributive and non-attributive constructions. Both corpora achieve an inter-annotator agreement of approximately 60-65 percent. Analysis reveals that lower coreference agreement correlates with automatic model disagreement, indicating higher ambiguity. Annotator comments further highlight varying confidence levels and individual reading strategies.

arxiv arXiv cs.CL · 12h ago

Agent-Authored World Modeling Aligns Training with Decision Needs

The paper introduces Agent-Authored World Modeling (AAWM), a training procedure that addresses the limitations of standard world modeling objectives tied to next-observation prediction. This traditional approach often omits dynamics relevant to an agent's current decision because supervision depends on what a transition reveals rather than what is needed. AAWM constructs supervision directly from the policy's decision needs by having the agent identify necessary environmental understanding at each state. Relevant transition evidence is retrieved across trajectories and synthesized into training targets that capture these decision-oriented dynamics. This method aligns the learning objective with the specific information required before acting, rather than forcing the model to reconstruct the next observation. Experimental results validate AAWM's effectiveness across multiple environments and training settings. The findings demonstrate that decision-aware world-model targets provide a more effective learning signal than conventional next-observation prediction.

arxiv arXiv cs.CL · 12h ago

OscillaTTS: Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

Diffusion-based text-to-speech models have improved speech quality but struggle with sharp prosodic transitions and rapid pitch variations. Existing decoders often use periodic nonlinearities like the Snake activation function, which lack adaptability for abrupt amplitude and frequency changes. To address this, the authors introduce OscillaTTS, a system featuring an adaptive oscillatory nonlinearity. This component enables controllable periodic modulation while ensuring signal stability via a linear bypass mechanism. The study investigates the role of oscillatory inductive bias within diffusion-based TTS decoders. Experiments conducted on the LJSpeech and Emotional Speech Dataset demonstrate consistent improvements in both objective and subjective evaluations. These results indicate that OscillaTTS effectively models expressive prosodic dynamics compared to prior methods.

arxiv arXiv cs.CL · 12h ago

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

The authors introduce PolicyAlign, a framework designed to align large language models directly with natural-language safety policies rather than relying on costly supervision data. This approach addresses the mismatch between rapidly evolving safety requirements and conventional data-driven alignment methods. The process begins by synthesizing instructions that violate the specified policy, followed by on-policy self-distillation to internalize the desired behavior. To enhance training stability and data efficiency, the method incorporates Policy-Sensitive Filtering, which selects instructions inducing the largest behavioral shift. Experiments across multiple models demonstrate that PolicyAlign consistently improves safety metrics while maintaining low over-refusal rates and preserving general capabilities. The framework also generalizes effectively to specialized domains such as medical, legal, and financial safety scenarios. The code for this scalable alignment approach is released at https://github.com/Qwen-Applications/PolicyAlign.

arxiv arXiv cs.CL · 12h ago

Translation-Enhanced Speech Encoder Pre-training Improves Speech LLMs

Connecting a pre-trained speech encoder to a Large Language Model creates a structural misalignment because encoders often produce language-specific representations while LLMs operate in a unified, language-agnostic space. The authors argue that incorporating speech translation objectives into the pre-training process provides a principled mechanism to bridge this gap. Unlike monolingual transcription, translation forces the model to learn representations that are independent of specific languages. The study experimentally evaluates the impact of adding these translation objectives during speech encoder pre-training. Results demonstrate that this approach significantly improves cross-modal integration between the speech and text modalities. Consequently, models utilizing translation-enhanced pre-training achieve superior performance across various downstream Speech LLM tasks.

arxiv arXiv cs.CL · 12h ago

Reclaim Evaluation Shows Lossy Memory Is Worse Than No Memory

A study demonstrates that a language model's memory containing incorrect conclusions is more detrimental than having no memory at all. When models retain stale values while dropping supporting work, they emit confident but wrong answers, whereas empty memories allow for abstention. This phenomenon, termed brittle memory, was observed across seven models where the direction of failure never reversed regardless of task or disposition. The researchers introduced reclaim evaluation to measure correctability by compressing interactions and testing if corrections recover ground truth without using a judge. Results indicate that correctability depends on whether the source information survives compression rather than model capability. A source-first policy, which keeps recomputable sources and drops re-derivable conclusions, restored correctability significantly better than length-matched controls. In chained memory loops, dropped-source errors corrupt downstream steps irreparably, while the proposed fix maintains bounded performance horizons. The findings replicate across three deployed systems and real dialogue data, with a hand-built oracle reaching perfect accuracy.

arxiv arXiv cs.CL · 13h ago

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Traditional evaluations reduce learning to a single aggregate score, obscuring how well knowledge from one example generalizes to others. The authors introduce the Generalization Spectrum, an evaluation framework that measures per-sample generalization by tracking performance across test variants with increasing transfer distance. These variants range from exact recall to implementation transfer across languages and context transfer under narrative reframing. The framework is instantiated on competitive programming using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. Comparisons of canonical learning paradigms show that Reinforcement Learning converts memorization into near-transfer more efficiently than Supervised Fine-Tuning baselines. In-context learning exhibits strong but correspondence-dependent transfer capabilities in this context. Diagnostic profiles reveal that local gains do not necessarily expand the generalization radius for all methods. Specifically, abstractions and hints mainly lift local transfer, while Reference SFT preserves a stronger far-transfer tail than RFT. Furthermore, self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.

arxiv arXiv cs.CL · 13h ago

Probing Self-Supervised Speech Representations on Mandarin Sub-dialects via Unsupervised Articulatory Analysis

This study investigates how internal phonetic representations in self-supervised speech models behave under fine-grained dialect variation, addressing the limitations of existing probing studies that rely on curated corpora. The authors present a case study using an entirely unlabeled probing pipeline for Mandarin sub-dialects. Phone sequences are generated via a language-agnostic universal phone recognizer and mapped to articulatory feature vectors, enabling frame-level probing without manual annotation. Results reveal structured patterns in articulatory feature decodability across different Mandarin dialects. Acoustically salient features like labiality and stridency remain comparatively stable, while those associated with finer spectral distinctions show larger dialect-dependent variation. This variation is driven primarily by elevated decodability for Beijing speech relative to other sub-dialects. Layer-wise analyses demonstrate distinct representational dynamics for these feature groups, suggesting uneven dialect sensitivity across articulatory dimensions.

arxiv arXiv cs.CL · 13h ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

The authors propose an end-to-end, fully differentiable neural architecture designed specifically for phoneme alignment to address the stagnation in this field compared to ASR advancements. The model features an encoder with two complementary branches dedicated to phoneme identity verification and boundary detection. A decoder implemented as a trainable module based on differentiable soft dynamic programming produces the final alignment decisions. The entire system is optimized using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. Experimental results show the approach outperforms current state-of-the-art methods on hand-annotated English benchmarks. Additionally, the model demonstrates strong word-level generalization capabilities and effective performance on unseen languages.

arxiv arXiv cs.CL · 13h ago

Fine-Tuned PEGASUS Achieves State-of-the-Art Performance on XL-Sum English Corpus

This paper presents a method for optimizing abstractive text summarization by fine-tuning the PEGASUS model on the XL-Sum English corpus. The objective is to surpass the performance of the baseline mT5 model in generating concise summaries that capture salient ideas without merely extracting sentences. The generated summaries are evaluated using the ROUGE metric, which compares auto-generated outputs against human-created references. The study claims that the fine-tuned PEGASUS model achieves state-of-the-art results on this specific dataset. Quantitative analysis reveals a 4.04% improvement in the ROUGE-1 score compared to the baseline. Additionally, the model demonstrates a significant 15.25% increase in the ROUGE-2 score. Finally, there is a reported 3.39% improvement in the ROUGE-L score, confirming the effectiveness of the fine-tuning approach.

arxiv arXiv cs.CL · 13h ago

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the concept of cliff tokens to identify specific single-token failure triggers in large language models during mathematical reasoning tasks. Unlike prior work that analyzes failures at step or sentence levels, this method pinpoints the exact token where potential drops significantly using an adaptive threshold based on a z-test. The study evaluates seven models across three benchmarks: GSM1K, MATH500, and AIME 2025. Deleting the first cliff token and resampling allows recovery of pass@64 to 1.0, whereas keeping it limits recovery between 0.71 and 1.00. The authors propose a taxonomy classifying cliffs as deterministic, uncertain, or sampled-off based on greedy choice and token entropy. This classification generalizes across different model scales and exhibits distinct probabilistic characteristics for each type. Furthermore, the team validates this taxonomy through single-token preference optimization known as Cliff-DPO. Trained on GSM8K, Cliff-DPO improves accuracy by up to +6.6 across benchmarks. Optimization proves effective for uncertain and sampled-off cliffs but yields no improvement for deterministic ones.

arxiv arXiv cs.CL · 14h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 14h ago

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

Recent large language models demonstrate strong mathematical reasoning, but these gains rely heavily on English-centric resources, leaving low-resource languages like Urdu with limited capabilities. To address this gap, researchers developed Riazi-8B, an Urdu model designed specifically for multi-step mathematical problem solving. The model was created through a two-step adaptation process involving continued pre-training on Urdu Wikipedia and supervised fine-tuning on Urdu Chain-of-Thought data derived from GSM8K. Evaluation of Riazi-8B was conducted on the MGSM-Urdu benchmark against existing Urdu instruction-tuned models. The results showed consistent improvements in answer correctness, reasoning quality, response completeness, and Urdu generation compared to baselines. These findings demonstrate that combining Urdu language adaptation with reasoning-focused fine-tuning effectively extends mathematical reasoning capabilities to low-resource languages.

arxiv arXiv cs.CL · 14h ago

Constraint Tax in Open-Weight LLMs: Tool Calling Suppression Under Structured Output Constraints

This study identifies a phenomenon called Tool Suppression, where open-weight language models cease invoking tools when JSON Schema constraints are simultaneously enabled. The authors observed this behavior in a production Agent system and reproduced it through controlled experiments across multiple model families. While tool execution and schema compliance function correctly when evaluated independently, they fail under joint deployment conditions. Analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, rendering tool-call tokens unreachable during decoding. To interpret these findings, the paper proposes the Constraint Priority Inversion hypothesis, suggesting schema satisfaction dominates action selection under simultaneous constraints. The authors mitigate this issue by introducing Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from response generation. This approach restores tool invocation while preserving structured output guarantees without requiring model retraining. The research highlights that evaluating capabilities separately may overlook critical reliability issues in production systems.

arxiv arXiv cs.CL · 14h ago

REVERIEMEM: Perspective-Bounded Memory for Book-Based Role-Playing Agents

Recent large language model role-playing systems often fail in long-narrative contexts due to factual overreach and stylistic monotony. Factual overreach occurs when characters access information outside their narrative perspective, while stylistic monotony flattens character voices through static profile descriptions. To address these issues, the authors propose REVERIEMEM, a three-layer memory architecture designed for book-based character agents. This system utilizes an episodic layer for first-person scene memories, a semantic layer for visibility-tagged facts, and a personality layer for situation-dependent behavioral patterns. The researchers also introduce KBF-QA, a benchmark consisting of 4,386 questions across eight novels to test knowledge boundaries. Experimental results show that REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points compared to prior methods. Additionally, the model achieves approximately a 79% win rate on BOOKWORLD's five-dimension pairwise narrative protocol. These findings suggest that perspective-bounded memory effectively enhances both factual accuracy and character-grounded narrative generation.

arxiv arXiv cs.CL · 15h ago

Framework Evaluates When GraphRAG and Agentic RAG Are Needed

The authors introduce a framework for evaluating and comparing regular, GraphRAG, Modular, and Agentic Retrieval-Augmented Generation (RAG) on semi-structured knowledge bases. They implement nine standardized scenarios spanning simple document retrieval to complex hybrid text-graph integration and agentic multi-step planning. A novel context engineering method is presented to address memory overflow issues in advanced RAG variants through new representations and agentic loop design. This optimization achieves a 19% to 53% reduction in token usage while efficiently managing retrievals. Further analysis reveals a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality. The study suggests that current retrieval-oriented metrics may overstate the benefits of advanced retrieval techniques. These data-driven insights aim to guide the development of production-ready intelligent RAG systems.

arxiv arXiv cs.CL · 15h ago

BITEMBED: Extreme Low-Bit Framework for LLM-Based Text Embeddings

The paper introduces BITEMBED, an extreme low-bit framework designed to address the high deployment costs of LLM-based text embedders by targeting both encoding efficiency and vector storage. The method converts pretrained LLM backbones into BitNet-style encoders featuring ternary weights, quantized activations, and lightweight normalization refinement. To adapt these models for representation learning, BITEMBED employs continual contrastive pre-training followed by supervised contrastive fine-tuning. This fine-tuning process utilizes similarity-distribution distillation and attention-relation distillation from a full-precision teacher model. Beyond backbone quantization, the framework trains output embeddings to support multiple storage precisions, allowing for flexible trade-offs between performance and storage costs. Experiments on the MMTEB benchmark using Qwen3-0.6B and Gemma3-270M demonstrate that BITEMBED performs largely comparably to full-precision teacher embedders.

arxiv arXiv cs.CL · 16h ago

Space-Efficient Language Generation in the Limit

This study initiates a resource-aware theory of language generation in the limit under space efficiency constraints. A learner observes an adversarial positive stream from a target language K and must output a hallucination-free hypothesis L while omitting at most Δ strings. The research focuses on DFAs with s states over an alphabet of size k as the hypothesis class for memory-bounded learners. In the exponential-space regime, the authors prove that a learner can exactly identify the target language K. Under stricter memory budgets, they present a streaming algorithm using poly(s,k) space that converges to a hypothesis with a generation gap of Δ= O(k^{2s-2}). This learned hypothesis captures every string in K of length at least 2s-1. The results are complemented by a near-matching lower bound derived from communication complexity, showing that achieving Δ≤ k^{(1-ε)s} requires k^{Ω(εs)} memory. These findings reveal a sharp transition between polynomial-space generation and exponential-space exact identification.

arxiv arXiv cs.CL · 16h ago

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Sparse Mixture-of-Experts (MoE) architectures often struggle with low-resource languages due to cross-lingual routing divergence that limits expert sharing. To address this, researchers propose SARA, a framework that transfers specialized capabilities from high-resource anchor languages to low-resource ones. SARA aligns the internal routing distributions of MoE layers using a symmetric Jensen-Shannon divergence constraint rather than operating on output logits. This approach encourages mechanistic consistency in expert selection across different languages. The authors evaluated the method on two large language models across five low-resource languages and three benchmarks. Results show SARA outperforms standard instruction tuning, achieving gains of +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct for Global-MMLU. These findings demonstrate that SARA effectively addresses performance bottlenecks in low-resource contexts.

arxiv arXiv cs.LG · 16h ago

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

The paper introduces HRLLI, a hierarchical reinforcement learning framework designed to improve sample efficiency by leveraging natural-language instructions. It addresses the limitation of existing approaches that treat instructions as static inputs, failing to account for their stage-dependent relevance in complex environments. The proposed method decomposes instructions into piecewise guidance elements that become relevant at different interaction stages. A novel Select-to-Act paradigm is formulated where a high-level semantic policy acts as a selector for the most relevant instruction piece based on the current state. This selected guidance conditions a low-level policy that executes environment actions, with both policies learned simultaneously to maximize augmented expected returns. Experiments on the RTFM benchmark demonstrate that HRLLI consistently outperforms strong instruction-conditioned RL baselines. The results confirm that explicitly modeling adaptive instruction selection significantly enhances reinforcement learning effectiveness.