All articles
arxiv arXiv cs.CL · 4h ago

Probing Self-Supervised Speech Representations on Mandarin Sub-dialects via Unsupervised Articulatory Analysis

This study investigates how internal phonetic representations in self-supervised speech models behave under fine-grained dialect variation, addressing the limitations of existing probing studies that rely on curated corpora. The authors present a case study using an entirely unlabeled probing pipeline for Mandarin sub-dialects. Phone sequences are generated via a language-agnostic universal phone recognizer and mapped to articulatory feature vectors, enabling frame-level probing without manual annotation. Results reveal structured patterns in articulatory feature decodability across different Mandarin dialects. Acoustically salient features like labiality and stridency remain comparatively stable, while those associated with finer spectral distinctions show larger dialect-dependent variation. This variation is driven primarily by elevated decodability for Beijing speech relative to other sub-dialects. Layer-wise analyses demonstrate distinct representational dynamics for these feature groups, suggesting uneven dialect sensitivity across articulatory dimensions.

arxiv arXiv cs.CL · 4h ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

The authors propose an end-to-end, fully differentiable neural architecture designed specifically for phoneme alignment to address the stagnation in this field compared to ASR advancements. The model features an encoder with two complementary branches dedicated to phoneme identity verification and boundary detection. A decoder implemented as a trainable module based on differentiable soft dynamic programming produces the final alignment decisions. The entire system is optimized using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. Experimental results show the approach outperforms current state-of-the-art methods on hand-annotated English benchmarks. Additionally, the model demonstrates strong word-level generalization capabilities and effective performance on unseen languages.

arxiv arXiv cs.CL · 4h ago

Fine-Tuned PEGASUS Achieves State-of-the-Art Performance on XL-Sum English Corpus

This paper presents a method for optimizing abstractive text summarization by fine-tuning the PEGASUS model on the XL-Sum English corpus. The objective is to surpass the performance of the baseline mT5 model in generating concise summaries that capture salient ideas without merely extracting sentences. The generated summaries are evaluated using the ROUGE metric, which compares auto-generated outputs against human-created references. The study claims that the fine-tuned PEGASUS model achieves state-of-the-art results on this specific dataset. Quantitative analysis reveals a 4.04% improvement in the ROUGE-1 score compared to the baseline. Additionally, the model demonstrates a significant 15.25% increase in the ROUGE-2 score. Finally, there is a reported 3.39% improvement in the ROUGE-L score, confirming the effectiveness of the fine-tuning approach.

arxiv arXiv cs.CL · 4h ago

Red Teaming Framework Uncovers LLM Faithfulness Vulnerabilities via Multi-Role Architecture

This paper introduces a red teaming framework designed to systematically uncover vulnerabilities in large language model outputs through a multi-role architecture. The system utilizes target, attacker, and jury models to generate adversarial prompts and rigorously evaluate response accuracy and consistency. In a case study on faithfulness evaluation, exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks. The research demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety and identifies how structural constraints shape vulnerability patterns. The framework shows adaptability across diverse evaluation tasks, ranging from English question-answering to Arabic summarization. However, the approach faces challenges in fully automating adversarial prompt generation across different languages. Additionally, experiments reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions.

arxiv arXiv cs.CL · 4h ago

Calibration and Adversarial Robustness of Automated ASR Scoring

This study evaluates the reliability of automated judges used to measure attack success rates in LLM jailbreaks by comparing them against human majority votes. Using 596 human-labeled completions from HarmBench, the authors find that dedicated safety classifiers over-flag with high recall but lower precision, while LLM-as-judges exhibit erratic recall ranging from 0.06 to 0.65. These discrepancies cause significant variability in reported ASR depending on which judge family is employed. The research also highlights sharp differences in robustness, showing that benign framing wrappers can flip LLM-judge decisions between 57% and 100% of the time. In contrast, dedicated classifiers resist such surface attacks but remain vulnerable to white-box GCG attacks, which flipped 70% of confident true positives despite a small optimization budget. A two-annotator audit confirmed that these adversarial flips preserved the underlying harmful content. Consequently, many current ASR metrics are deemed unreliable under deliberate pressure or average conditions. The authors recommend reporting judge precision and recall on human-labeled data and including adversarial checks in future research.

arxiv arXiv cs.CL · 5h ago

STC Improves Arabic Customer Service via MARBERT Sentiment Analysis

Saudi Telecom Company (STC) aims to enhance user satisfaction by leveraging Twitter feedback for sentiment analysis. The study addresses the gap in Arabic Natural Language Processing by training the MARBERT model on a specific dataset of 24,513 tweets. This collection includes 1,437 positive, 13,828 negative, and 5,694 neutral tweets, alongside 1,221 sarcastic and 2,297 indeterminate entries. The primary objective is to analyze these sentiments to improve STC's customer service responsiveness. Performance was evaluated using f1-score, precision, and recall metrics to ensure robust detection of spam and sentiment. Results indicate that the proposed scheme offers promising accuracy compared to existing techniques in the literature.

arxiv arXiv cs.CL · 5h ago

Behavioral Drivers of Rating-Sentiment Incongruence in Sri Lankan Tourism Reviews

This study investigates the incongruence between star ratings and written review sentiments within Sri Lankan tourism attraction reviews. Analyzing a dataset of 16,156 reviews from 2010 to 2023, researchers employed a transformer-based pipeline to derive textual sentiment independently of assigned ratings. The analysis reveals that 18.6% of reviews exhibit incongruence, primarily driven by Conservative Rater and Obligatory 5-Star behaviors. These mismatches vary across venue types, with museums demonstrating the highest rates of divergence. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as key contributors to this phenomenon. The findings demonstrate that star ratings are not interchangeable with textual sentiment and require validation before being used as ground-truth labels in NLP tasks.

arxiv arXiv cs.CL · 5h ago

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the concept of cliff tokens to identify specific single-token failure triggers in large language models during mathematical reasoning tasks. Unlike prior work that analyzes failures at step or sentence levels, this method pinpoints the exact token where potential drops significantly using an adaptive threshold based on a z-test. The study evaluates seven models across three benchmarks: GSM1K, MATH500, and AIME 2025. Deleting the first cliff token and resampling allows recovery of pass@64 to 1.0, whereas keeping it limits recovery between 0.71 and 1.00. The authors propose a taxonomy classifying cliffs as deterministic, uncertain, or sampled-off based on greedy choice and token entropy. This classification generalizes across different model scales and exhibits distinct probabilistic characteristics for each type. Furthermore, the team validates this taxonomy through single-token preference optimization known as Cliff-DPO. Trained on GSM8K, Cliff-DPO improves accuracy by up to +6.6 across benchmarks. Optimization proves effective for uncertain and sampled-off cliffs but yields no improvement for deterministic ones.

arxiv arXiv cs.CL · 5h ago

SWE-Pro Benchmark Reveals Significant Gap Between LLMs and Expert Software Optimization

The SWE-Pro benchmark addresses the lack of realistic evaluation frameworks for software performance optimization by introducing a repository-level dataset derived from 102 expert-written optimizations. Unlike previous benchmarks that oversimplify tasks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage under noise-aware conditions. The study reveals that current Large Language Models struggle significantly with these complex requirements, showing negligible runtime gains and nearly non-existent memory optimizations. In sharp contrast, expert implementations achieved an aggregate speedup of 15.5x and a peak memory reduction of 171.3x across the benchmark tasks. Expert-written improvements were observed in 91.2% of tasks for runtime and 65.7% for peak memory. These findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

arxiv arXiv cs.CL · 5h ago

Security and Privacy in Retrieval-Augmented Generation: Architectures, Threats, Defenses, and Future Directions

This survey examines the security and privacy challenges inherent in Retrieval-Augmented Generation (RAG) systems across centralized, on-device, federated, and hybrid paradigms. It presents a unified taxonomy of threat surfaces that span retrieval, context construction, and generation stages. The analysis covers specific attack classes including membership inference, index inference, poisoning, gradient leakage, and collusion. Sensitive information risks are identified within retrieval indices, query logs, context construction, and federated updates. Adversarial manipulation of knowledge bases is highlighted as a key factor undermining trust in generated outputs. The paper reviews architectural, algorithmic, and cryptographic defenses while addressing privacy-utility trade-offs. Finally, it outlines open research challenges for building trustworthy and resilient RAG systems.

arxiv arXiv cs.CL · 5h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 5h ago

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

The authors identify a fundamental state-action credit mismatch in stepwise group-based RL for long-horizon LLM agents. Current estimators suffer from overly fine state partitioning and coarse action averaging, which violates equivalence assumptions for credit assignment. BiPACE is introduced as a drop-in advantage estimator that fixes these issues without adding critics or extra rollouts. It clusters steps by cosine distance in the actor's hidden-state geometry to reduce singleton groups and recenters returns using action-conditioned peer baselines. On ALFWorld with Qwen2.5-7B, BiPACE_Q raises validation success from 90.8 to 97.1±0.9, crossing the 95% threshold on every seed. It also improves performance on Qwen2.5-1.5B and achieves gains on WebShop and TextCraft over GRPO and GiGPO. The method incurs only 11.3% overhead of a single training-step wall time while changing the comparison unit to approximate behavioral equivalence.

arxiv arXiv cs.CL · 5h ago

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

Recent large language models demonstrate strong mathematical reasoning, but these gains rely heavily on English-centric resources, leaving low-resource languages like Urdu with limited capabilities. To address this gap, researchers developed Riazi-8B, an Urdu model designed specifically for multi-step mathematical problem solving. The model was created through a two-step adaptation process involving continued pre-training on Urdu Wikipedia and supervised fine-tuning on Urdu Chain-of-Thought data derived from GSM8K. Evaluation of Riazi-8B was conducted on the MGSM-Urdu benchmark against existing Urdu instruction-tuned models. The results showed consistent improvements in answer correctness, reasoning quality, response completeness, and Urdu generation compared to baselines. These findings demonstrate that combining Urdu language adaptation with reasoning-focused fine-tuning effectively extends mathematical reasoning capabilities to low-resource languages.

arxiv arXiv cs.CL · 5h ago

Constraint Tax in Open-Weight LLMs: Tool Calling Suppression Under Structured Output Constraints

This study identifies a phenomenon called Tool Suppression, where open-weight language models cease invoking tools when JSON Schema constraints are simultaneously enabled. The authors observed this behavior in a production Agent system and reproduced it through controlled experiments across multiple model families. While tool execution and schema compliance function correctly when evaluated independently, they fail under joint deployment conditions. Analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, rendering tool-call tokens unreachable during decoding. To interpret these findings, the paper proposes the Constraint Priority Inversion hypothesis, suggesting schema satisfaction dominates action selection under simultaneous constraints. The authors mitigate this issue by introducing Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from response generation. This approach restores tool invocation while preserving structured output guarantees without requiring model retraining. The research highlights that evaluating capabilities separately may overlook critical reliability issues in production systems.

arxiv arXiv cs.CL · 5h ago

REVERIEMEM: Perspective-Bounded Memory for Book-Based Role-Playing Agents

Recent large language model role-playing systems often fail in long-narrative contexts due to factual overreach and stylistic monotony. Factual overreach occurs when characters access information outside their narrative perspective, while stylistic monotony flattens character voices through static profile descriptions. To address these issues, the authors propose REVERIEMEM, a three-layer memory architecture designed for book-based character agents. This system utilizes an episodic layer for first-person scene memories, a semantic layer for visibility-tagged facts, and a personality layer for situation-dependent behavioral patterns. The researchers also introduce KBF-QA, a benchmark consisting of 4,386 questions across eight novels to test knowledge boundaries. Experimental results show that REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points compared to prior methods. Additionally, the model achieves approximately a 79% win rate on BOOKWORLD's five-dimension pairwise narrative protocol. These findings suggest that perspective-bounded memory effectively enhances both factual accuracy and character-grounded narrative generation.

arxiv arXiv cs.CL · 6h ago

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

The authors propose MedGuards, a medical safety guardrail framework designed to detect and correct errors in text generated by Large Language Models. This system treats error handling as a multi-agent in-context learning task where specialized agents separately perform detection, localization, and correction. A confidence-guided arbitration mechanism resolves disagreements among agents using reasoning traces and confidence scores without requiring additional model training. The study introduces the Keyword-Prioritized Correction Score (KPCS), a new metric that evaluates the accuracy of critical keywords within reference text. Experiments conducted across four multilingual medical datasets of clinical notes demonstrate significant improvements in performance metrics. These results highlight enhanced interpretability, robustness, and adaptability for safer LLM deployment in healthcare. The code for the MedErrBench benchmark is publicly available on GitHub.

github llama.cpp · 6h ago

llama.cpp b9786 Release Adds OpenCL Non-Contiguous Row Support

The llama.cpp project has released version b9786, introducing support for non-contiguous rows in normalization via OpenCL. This update is part of the ongoing development by the ggml-org team to enhance hardware compatibility and performance across various platforms. The release provides binaries for macOS Apple Silicon, Intel Macs, and iOS XCFrameworks. Linux users can access builds for Ubuntu x64, arm64, and s390x architectures using CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends. Android support is available for arm64 CPU devices, while Windows offers extensive options including CPU, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also lists disabled builds for KleidiAI on macOS and openEuler platforms.

arxiv arXiv cs.CL · 6h ago

Framework Evaluates When GraphRAG and Agentic RAG Are Needed

The authors introduce a framework for evaluating and comparing regular, GraphRAG, Modular, and Agentic Retrieval-Augmented Generation (RAG) on semi-structured knowledge bases. They implement nine standardized scenarios spanning simple document retrieval to complex hybrid text-graph integration and agentic multi-step planning. A novel context engineering method is presented to address memory overflow issues in advanced RAG variants through new representations and agentic loop design. This optimization achieves a 19% to 53% reduction in token usage while efficiently managing retrievals. Further analysis reveals a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality. The study suggests that current retrieval-oriented metrics may overstate the benefits of advanced retrieval techniques. These data-driven insights aim to guide the development of production-ready intelligent RAG systems.

arxiv arXiv cs.CL · 6h ago

BITEMBED: Extreme Low-Bit Framework for LLM-Based Text Embeddings

The paper introduces BITEMBED, an extreme low-bit framework designed to address the high deployment costs of LLM-based text embedders by targeting both encoding efficiency and vector storage. The method converts pretrained LLM backbones into BitNet-style encoders featuring ternary weights, quantized activations, and lightweight normalization refinement. To adapt these models for representation learning, BITEMBED employs continual contrastive pre-training followed by supervised contrastive fine-tuning. This fine-tuning process utilizes similarity-distribution distillation and attention-relation distillation from a full-precision teacher model. Beyond backbone quantization, the framework trains output embeddings to support multiple storage precisions, allowing for flexible trade-offs between performance and storage costs. Experiments on the MMTEB benchmark using Qwen3-0.6B and Gemma3-270M demonstrate that BITEMBED performs largely comparably to full-precision teacher embedders.

arxiv arXiv cs.CL · 6h ago

TRACE: Lightweight Detection of Corpus Poisoning in RAG via Token Influence Attribution

Retrieval-Augmented Generation systems face significant risks from corpus poisoning attacks that manipulate outputs through malicious documents. Existing detection methods often require auxiliary classifiers or additional LLM verification, which introduces substantial computational overhead. To address this, researchers introduced TRACE, a lightweight framework that identifies poisoning by tracing answer-related tokens via influence attribution. The system first discovers recurrent high-influence keywords across retrieved documents to flag potential threats. It then performs secondary verification to confirm the specific influence of these tokens on model predictions. Experiments conducted on three QA benchmarks and six LLMs demonstrate strong detection performance for the framework. Additionally, TRACE successfully uncovers attacker-specified target answers during the verification process.