All articles
arxiv arXiv cs.CL · 8h ago

Harness Design and Post-Training in LLM Agents

The article examines how tool harness design impacts the post-training of large language model agents. It argues that while agents are routinely post-trained, the scaffolding determining tool exposure is often treated as a fixed detail. Existing algorithms typically assume static environments, ignoring shifts in tools and tasks during deployment. To address this gap, the authors extended ALFWorld to treat harness design as a controllable dimension. This extension supports evaluation under both task and tool environment shifts. The study systematically analyzes harness influence on post-training in in-distribution and out-of-distribution settings. Results show that harness-aware post-training improves performance and enables robust adaptation to new environments. Conversely, minimal design effort leads to drastic performance drops under strong environmental shifts.

arxiv arXiv cs.CL · 8h ago

Reclaim Evaluation Shows Lossy Memory Is Worse Than No Memory

A study demonstrates that a language model's memory containing incorrect conclusions is more detrimental than having no memory at all. When models retain stale values while dropping supporting work, they emit confident but wrong answers, whereas empty memories allow for abstention. This phenomenon, termed brittle memory, was observed across seven models where the direction of failure never reversed regardless of task or disposition. The researchers introduced reclaim evaluation to measure correctability by compressing interactions and testing if corrections recover ground truth without using a judge. Results indicate that correctability depends on whether the source information survives compression rather than model capability. A source-first policy, which keeps recomputable sources and drops re-derivable conclusions, restored correctability significantly better than length-matched controls. In chained memory loops, dropped-source errors corrupt downstream steps irreparably, while the proposed fix maintains bounded performance horizons. The findings replicate across three deployed systems and real dialogue data, with a hand-built oracle reaching perfect accuracy.

arxiv arXiv cs.CL · 8h ago

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Traditional evaluations reduce learning to a single aggregate score, obscuring how well knowledge from one example generalizes to others. The authors introduce the Generalization Spectrum, an evaluation framework that measures per-sample generalization by tracking performance across test variants with increasing transfer distance. These variants range from exact recall to implementation transfer across languages and context transfer under narrative reframing. The framework is instantiated on competitive programming using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. Comparisons of canonical learning paradigms show that Reinforcement Learning converts memorization into near-transfer more efficiently than Supervised Fine-Tuning baselines. In-context learning exhibits strong but correspondence-dependent transfer capabilities in this context. Diagnostic profiles reveal that local gains do not necessarily expand the generalization radius for all methods. Specifically, abstractions and hints mainly lift local transfer, while Reference SFT preserves a stronger far-transfer tail than RFT. Furthermore, self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.

arxiv arXiv cs.CL · 8h ago

Probing Self-Supervised Speech Representations on Mandarin Sub-dialects via Unsupervised Articulatory Analysis

This study investigates how internal phonetic representations in self-supervised speech models behave under fine-grained dialect variation, addressing the limitations of existing probing studies that rely on curated corpora. The authors present a case study using an entirely unlabeled probing pipeline for Mandarin sub-dialects. Phone sequences are generated via a language-agnostic universal phone recognizer and mapped to articulatory feature vectors, enabling frame-level probing without manual annotation. Results reveal structured patterns in articulatory feature decodability across different Mandarin dialects. Acoustically salient features like labiality and stridency remain comparatively stable, while those associated with finer spectral distinctions show larger dialect-dependent variation. This variation is driven primarily by elevated decodability for Beijing speech relative to other sub-dialects. Layer-wise analyses demonstrate distinct representational dynamics for these feature groups, suggesting uneven dialect sensitivity across articulatory dimensions.

arxiv arXiv cs.CL · 8h ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

The authors propose an end-to-end, fully differentiable neural architecture designed specifically for phoneme alignment to address the stagnation in this field compared to ASR advancements. The model features an encoder with two complementary branches dedicated to phoneme identity verification and boundary detection. A decoder implemented as a trainable module based on differentiable soft dynamic programming produces the final alignment decisions. The entire system is optimized using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. Experimental results show the approach outperforms current state-of-the-art methods on hand-annotated English benchmarks. Additionally, the model demonstrates strong word-level generalization capabilities and effective performance on unseen languages.

arxiv arXiv cs.CL · 8h ago

Fine-Tuned PEGASUS Achieves State-of-the-Art Performance on XL-Sum English Corpus

This paper presents a method for optimizing abstractive text summarization by fine-tuning the PEGASUS model on the XL-Sum English corpus. The objective is to surpass the performance of the baseline mT5 model in generating concise summaries that capture salient ideas without merely extracting sentences. The generated summaries are evaluated using the ROUGE metric, which compares auto-generated outputs against human-created references. The study claims that the fine-tuned PEGASUS model achieves state-of-the-art results on this specific dataset. Quantitative analysis reveals a 4.04% improvement in the ROUGE-1 score compared to the baseline. Additionally, the model demonstrates a significant 15.25% increase in the ROUGE-2 score. Finally, there is a reported 3.39% improvement in the ROUGE-L score, confirming the effectiveness of the fine-tuning approach.

arxiv arXiv cs.CL · 8h ago

Red Teaming Framework Uncovers LLM Faithfulness Vulnerabilities via Multi-Role Architecture

This paper introduces a red teaming framework designed to systematically uncover vulnerabilities in large language model outputs through a multi-role architecture. The system utilizes target, attacker, and jury models to generate adversarial prompts and rigorously evaluate response accuracy and consistency. In a case study on faithfulness evaluation, exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks. The research demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety and identifies how structural constraints shape vulnerability patterns. The framework shows adaptability across diverse evaluation tasks, ranging from English question-answering to Arabic summarization. However, the approach faces challenges in fully automating adversarial prompt generation across different languages. Additionally, experiments reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions.

arxiv arXiv cs.CL · 8h ago

Calibration and Adversarial Robustness of Automated ASR Scoring

This study evaluates the reliability of automated judges used to measure attack success rates in LLM jailbreaks by comparing them against human majority votes. Using 596 human-labeled completions from HarmBench, the authors find that dedicated safety classifiers over-flag with high recall but lower precision, while LLM-as-judges exhibit erratic recall ranging from 0.06 to 0.65. These discrepancies cause significant variability in reported ASR depending on which judge family is employed. The research also highlights sharp differences in robustness, showing that benign framing wrappers can flip LLM-judge decisions between 57% and 100% of the time. In contrast, dedicated classifiers resist such surface attacks but remain vulnerable to white-box GCG attacks, which flipped 70% of confident true positives despite a small optimization budget. A two-annotator audit confirmed that these adversarial flips preserved the underlying harmful content. Consequently, many current ASR metrics are deemed unreliable under deliberate pressure or average conditions. The authors recommend reporting judge precision and recall on human-labeled data and including adversarial checks in future research.

arxiv arXiv cs.CL · 9h ago

STC Improves Arabic Customer Service via MARBERT Sentiment Analysis

Saudi Telecom Company (STC) aims to enhance user satisfaction by leveraging Twitter feedback for sentiment analysis. The study addresses the gap in Arabic Natural Language Processing by training the MARBERT model on a specific dataset of 24,513 tweets. This collection includes 1,437 positive, 13,828 negative, and 5,694 neutral tweets, alongside 1,221 sarcastic and 2,297 indeterminate entries. The primary objective is to analyze these sentiments to improve STC's customer service responsiveness. Performance was evaluated using f1-score, precision, and recall metrics to ensure robust detection of spam and sentiment. Results indicate that the proposed scheme offers promising accuracy compared to existing techniques in the literature.

arxiv arXiv cs.CL · 9h ago

Behavioral Drivers of Rating-Sentiment Incongruence in Sri Lankan Tourism Reviews

This study investigates the incongruence between star ratings and written review sentiments within Sri Lankan tourism attraction reviews. Analyzing a dataset of 16,156 reviews from 2010 to 2023, researchers employed a transformer-based pipeline to derive textual sentiment independently of assigned ratings. The analysis reveals that 18.6% of reviews exhibit incongruence, primarily driven by Conservative Rater and Obligatory 5-Star behaviors. These mismatches vary across venue types, with museums demonstrating the highest rates of divergence. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as key contributors to this phenomenon. The findings demonstrate that star ratings are not interchangeable with textual sentiment and require validation before being used as ground-truth labels in NLP tasks.

arxiv arXiv cs.CL · 9h ago

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the concept of cliff tokens to identify specific single-token failure triggers in large language models during mathematical reasoning tasks. Unlike prior work that analyzes failures at step or sentence levels, this method pinpoints the exact token where potential drops significantly using an adaptive threshold based on a z-test. The study evaluates seven models across three benchmarks: GSM1K, MATH500, and AIME 2025. Deleting the first cliff token and resampling allows recovery of pass@64 to 1.0, whereas keeping it limits recovery between 0.71 and 1.00. The authors propose a taxonomy classifying cliffs as deterministic, uncertain, or sampled-off based on greedy choice and token entropy. This classification generalizes across different model scales and exhibits distinct probabilistic characteristics for each type. Furthermore, the team validates this taxonomy through single-token preference optimization known as Cliff-DPO. Trained on GSM8K, Cliff-DPO improves accuracy by up to +6.6 across benchmarks. Optimization proves effective for uncertain and sampled-off cliffs but yields no improvement for deterministic ones.

arxiv arXiv cs.CL · 9h ago

SWE-Pro Benchmark Reveals Significant Gap Between LLMs and Expert Software Optimization

The SWE-Pro benchmark addresses the lack of realistic evaluation frameworks for software performance optimization by introducing a repository-level dataset derived from 102 expert-written optimizations. Unlike previous benchmarks that oversimplify tasks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage under noise-aware conditions. The study reveals that current Large Language Models struggle significantly with these complex requirements, showing negligible runtime gains and nearly non-existent memory optimizations. In sharp contrast, expert implementations achieved an aggregate speedup of 15.5x and a peak memory reduction of 171.3x across the benchmark tasks. Expert-written improvements were observed in 91.2% of tasks for runtime and 65.7% for peak memory. These findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

arxiv arXiv cs.CL · 9h ago

Security and Privacy in Retrieval-Augmented Generation: Architectures, Threats, Defenses, and Future Directions

This survey examines the security and privacy challenges inherent in Retrieval-Augmented Generation (RAG) systems across centralized, on-device, federated, and hybrid paradigms. It presents a unified taxonomy of threat surfaces that span retrieval, context construction, and generation stages. The analysis covers specific attack classes including membership inference, index inference, poisoning, gradient leakage, and collusion. Sensitive information risks are identified within retrieval indices, query logs, context construction, and federated updates. Adversarial manipulation of knowledge bases is highlighted as a key factor undermining trust in generated outputs. The paper reviews architectural, algorithmic, and cryptographic defenses while addressing privacy-utility trade-offs. Finally, it outlines open research challenges for building trustworthy and resilient RAG systems.

arxiv arXiv cs.CL · 9h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 9h ago

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

The authors identify a fundamental state-action credit mismatch in stepwise group-based RL for long-horizon LLM agents. Current estimators suffer from overly fine state partitioning and coarse action averaging, which violates equivalence assumptions for credit assignment. BiPACE is introduced as a drop-in advantage estimator that fixes these issues without adding critics or extra rollouts. It clusters steps by cosine distance in the actor's hidden-state geometry to reduce singleton groups and recenters returns using action-conditioned peer baselines. On ALFWorld with Qwen2.5-7B, BiPACE_Q raises validation success from 90.8 to 97.1±0.9, crossing the 95% threshold on every seed. It also improves performance on Qwen2.5-1.5B and achieves gains on WebShop and TextCraft over GRPO and GiGPO. The method incurs only 11.3% overhead of a single training-step wall time while changing the comparison unit to approximate behavioral equivalence.

arxiv arXiv cs.CL · 9h ago

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

Recent large language models demonstrate strong mathematical reasoning, but these gains rely heavily on English-centric resources, leaving low-resource languages like Urdu with limited capabilities. To address this gap, researchers developed Riazi-8B, an Urdu model designed specifically for multi-step mathematical problem solving. The model was created through a two-step adaptation process involving continued pre-training on Urdu Wikipedia and supervised fine-tuning on Urdu Chain-of-Thought data derived from GSM8K. Evaluation of Riazi-8B was conducted on the MGSM-Urdu benchmark against existing Urdu instruction-tuned models. The results showed consistent improvements in answer correctness, reasoning quality, response completeness, and Urdu generation compared to baselines. These findings demonstrate that combining Urdu language adaptation with reasoning-focused fine-tuning effectively extends mathematical reasoning capabilities to low-resource languages.

arxiv arXiv cs.CL · 9h ago

Constraint Tax in Open-Weight LLMs: Tool Calling Suppression Under Structured Output Constraints

This study identifies a phenomenon called Tool Suppression, where open-weight language models cease invoking tools when JSON Schema constraints are simultaneously enabled. The authors observed this behavior in a production Agent system and reproduced it through controlled experiments across multiple model families. While tool execution and schema compliance function correctly when evaluated independently, they fail under joint deployment conditions. Analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, rendering tool-call tokens unreachable during decoding. To interpret these findings, the paper proposes the Constraint Priority Inversion hypothesis, suggesting schema satisfaction dominates action selection under simultaneous constraints. The authors mitigate this issue by introducing Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from response generation. This approach restores tool invocation while preserving structured output guarantees without requiring model retraining. The research highlights that evaluating capabilities separately may overlook critical reliability issues in production systems.

arxiv arXiv cs.CL · 9h ago

REVERIEMEM: Perspective-Bounded Memory for Book-Based Role-Playing Agents

Recent large language model role-playing systems often fail in long-narrative contexts due to factual overreach and stylistic monotony. Factual overreach occurs when characters access information outside their narrative perspective, while stylistic monotony flattens character voices through static profile descriptions. To address these issues, the authors propose REVERIEMEM, a three-layer memory architecture designed for book-based character agents. This system utilizes an episodic layer for first-person scene memories, a semantic layer for visibility-tagged facts, and a personality layer for situation-dependent behavioral patterns. The researchers also introduce KBF-QA, a benchmark consisting of 4,386 questions across eight novels to test knowledge boundaries. Experimental results show that REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points compared to prior methods. Additionally, the model achieves approximately a 79% win rate on BOOKWORLD's five-dimension pairwise narrative protocol. These findings suggest that perspective-bounded memory effectively enhances both factual accuracy and character-grounded narrative generation.

arxiv arXiv cs.CL · 10h ago

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

The authors propose MedGuards, a medical safety guardrail framework designed to detect and correct errors in text generated by Large Language Models. This system treats error handling as a multi-agent in-context learning task where specialized agents separately perform detection, localization, and correction. A confidence-guided arbitration mechanism resolves disagreements among agents using reasoning traces and confidence scores without requiring additional model training. The study introduces the Keyword-Prioritized Correction Score (KPCS), a new metric that evaluates the accuracy of critical keywords within reference text. Experiments conducted across four multilingual medical datasets of clinical notes demonstrate significant improvements in performance metrics. These results highlight enhanced interpretability, robustness, and adaptability for safer LLM deployment in healthcare. The code for the MedErrBench benchmark is publicly available on GitHub.

github llama.cpp · 10h ago

llama.cpp b9786 Release Adds OpenCL Non-Contiguous Row Support

The llama.cpp project has released version b9786, introducing support for non-contiguous rows in normalization via OpenCL. This update is part of the ongoing development by the ggml-org team to enhance hardware compatibility and performance across various platforms. The release provides binaries for macOS Apple Silicon, Intel Macs, and iOS XCFrameworks. Linux users can access builds for Ubuntu x64, arm64, and s390x architectures using CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends. Android support is available for arm64 CPU devices, while Windows offers extensive options including CPU, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also lists disabled builds for KleidiAI on macOS and openEuler platforms.