All articles
arxiv arXiv cs.CL · 10h ago

Calibration and Adversarial Robustness of Automated ASR Scoring

This study evaluates the reliability of automated judges used to measure attack success rates in LLM jailbreaks by comparing them against human majority votes. Using 596 human-labeled completions from HarmBench, the authors find that dedicated safety classifiers over-flag with high recall but lower precision, while LLM-as-judges exhibit erratic recall ranging from 0.06 to 0.65. These discrepancies cause significant variability in reported ASR depending on which judge family is employed. The research also highlights sharp differences in robustness, showing that benign framing wrappers can flip LLM-judge decisions between 57% and 100% of the time. In contrast, dedicated classifiers resist such surface attacks but remain vulnerable to white-box GCG attacks, which flipped 70% of confident true positives despite a small optimization budget. A two-annotator audit confirmed that these adversarial flips preserved the underlying harmful content. Consequently, many current ASR metrics are deemed unreliable under deliberate pressure or average conditions. The authors recommend reporting judge precision and recall on human-labeled data and including adversarial checks in future research.

arxiv arXiv cs.CL · 10h ago

STC Improves Arabic Customer Service via MARBERT Sentiment Analysis

Saudi Telecom Company (STC) aims to enhance user satisfaction by leveraging Twitter feedback for sentiment analysis. The study addresses the gap in Arabic Natural Language Processing by training the MARBERT model on a specific dataset of 24,513 tweets. This collection includes 1,437 positive, 13,828 negative, and 5,694 neutral tweets, alongside 1,221 sarcastic and 2,297 indeterminate entries. The primary objective is to analyze these sentiments to improve STC's customer service responsiveness. Performance was evaluated using f1-score, precision, and recall metrics to ensure robust detection of spam and sentiment. Results indicate that the proposed scheme offers promising accuracy compared to existing techniques in the literature.

arxiv arXiv cs.CL · 10h ago

Behavioral Drivers of Rating-Sentiment Incongruence in Sri Lankan Tourism Reviews

This study investigates the incongruence between star ratings and written review sentiments within Sri Lankan tourism attraction reviews. Analyzing a dataset of 16,156 reviews from 2010 to 2023, researchers employed a transformer-based pipeline to derive textual sentiment independently of assigned ratings. The analysis reveals that 18.6% of reviews exhibit incongruence, primarily driven by Conservative Rater and Obligatory 5-Star behaviors. These mismatches vary across venue types, with museums demonstrating the highest rates of divergence. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as key contributors to this phenomenon. The findings demonstrate that star ratings are not interchangeable with textual sentiment and require validation before being used as ground-truth labels in NLP tasks.

arxiv arXiv cs.CL · 10h ago

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the concept of cliff tokens to identify specific single-token failure triggers in large language models during mathematical reasoning tasks. Unlike prior work that analyzes failures at step or sentence levels, this method pinpoints the exact token where potential drops significantly using an adaptive threshold based on a z-test. The study evaluates seven models across three benchmarks: GSM1K, MATH500, and AIME 2025. Deleting the first cliff token and resampling allows recovery of pass@64 to 1.0, whereas keeping it limits recovery between 0.71 and 1.00. The authors propose a taxonomy classifying cliffs as deterministic, uncertain, or sampled-off based on greedy choice and token entropy. This classification generalizes across different model scales and exhibits distinct probabilistic characteristics for each type. Furthermore, the team validates this taxonomy through single-token preference optimization known as Cliff-DPO. Trained on GSM8K, Cliff-DPO improves accuracy by up to +6.6 across benchmarks. Optimization proves effective for uncertain and sampled-off cliffs but yields no improvement for deterministic ones.

arxiv arXiv cs.CL · 10h ago

SWE-Pro Benchmark Reveals Significant Gap Between LLMs and Expert Software Optimization

The SWE-Pro benchmark addresses the lack of realistic evaluation frameworks for software performance optimization by introducing a repository-level dataset derived from 102 expert-written optimizations. Unlike previous benchmarks that oversimplify tasks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage under noise-aware conditions. The study reveals that current Large Language Models struggle significantly with these complex requirements, showing negligible runtime gains and nearly non-existent memory optimizations. In sharp contrast, expert implementations achieved an aggregate speedup of 15.5x and a peak memory reduction of 171.3x across the benchmark tasks. Expert-written improvements were observed in 91.2% of tasks for runtime and 65.7% for peak memory. These findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

arxiv arXiv cs.CL · 10h ago

Security and Privacy in Retrieval-Augmented Generation: Architectures, Threats, Defenses, and Future Directions

This survey examines the security and privacy challenges inherent in Retrieval-Augmented Generation (RAG) systems across centralized, on-device, federated, and hybrid paradigms. It presents a unified taxonomy of threat surfaces that span retrieval, context construction, and generation stages. The analysis covers specific attack classes including membership inference, index inference, poisoning, gradient leakage, and collusion. Sensitive information risks are identified within retrieval indices, query logs, context construction, and federated updates. Adversarial manipulation of knowledge bases is highlighted as a key factor undermining trust in generated outputs. The paper reviews architectural, algorithmic, and cryptographic defenses while addressing privacy-utility trade-offs. Finally, it outlines open research challenges for building trustworthy and resilient RAG systems.

arxiv arXiv cs.CL · 10h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 11h ago

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

The authors identify a fundamental state-action credit mismatch in stepwise group-based RL for long-horizon LLM agents. Current estimators suffer from overly fine state partitioning and coarse action averaging, which violates equivalence assumptions for credit assignment. BiPACE is introduced as a drop-in advantage estimator that fixes these issues without adding critics or extra rollouts. It clusters steps by cosine distance in the actor's hidden-state geometry to reduce singleton groups and recenters returns using action-conditioned peer baselines. On ALFWorld with Qwen2.5-7B, BiPACE_Q raises validation success from 90.8 to 97.1±0.9, crossing the 95% threshold on every seed. It also improves performance on Qwen2.5-1.5B and achieves gains on WebShop and TextCraft over GRPO and GiGPO. The method incurs only 11.3% overhead of a single training-step wall time while changing the comparison unit to approximate behavioral equivalence.

arxiv arXiv cs.CL · 11h ago

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

Recent large language models demonstrate strong mathematical reasoning, but these gains rely heavily on English-centric resources, leaving low-resource languages like Urdu with limited capabilities. To address this gap, researchers developed Riazi-8B, an Urdu model designed specifically for multi-step mathematical problem solving. The model was created through a two-step adaptation process involving continued pre-training on Urdu Wikipedia and supervised fine-tuning on Urdu Chain-of-Thought data derived from GSM8K. Evaluation of Riazi-8B was conducted on the MGSM-Urdu benchmark against existing Urdu instruction-tuned models. The results showed consistent improvements in answer correctness, reasoning quality, response completeness, and Urdu generation compared to baselines. These findings demonstrate that combining Urdu language adaptation with reasoning-focused fine-tuning effectively extends mathematical reasoning capabilities to low-resource languages.

arxiv arXiv cs.CL · 11h ago

Constraint Tax in Open-Weight LLMs: Tool Calling Suppression Under Structured Output Constraints

This study identifies a phenomenon called Tool Suppression, where open-weight language models cease invoking tools when JSON Schema constraints are simultaneously enabled. The authors observed this behavior in a production Agent system and reproduced it through controlled experiments across multiple model families. While tool execution and schema compliance function correctly when evaluated independently, they fail under joint deployment conditions. Analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, rendering tool-call tokens unreachable during decoding. To interpret these findings, the paper proposes the Constraint Priority Inversion hypothesis, suggesting schema satisfaction dominates action selection under simultaneous constraints. The authors mitigate this issue by introducing Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from response generation. This approach restores tool invocation while preserving structured output guarantees without requiring model retraining. The research highlights that evaluating capabilities separately may overlook critical reliability issues in production systems.

arxiv arXiv cs.CL · 11h ago

REVERIEMEM: Perspective-Bounded Memory for Book-Based Role-Playing Agents

Recent large language model role-playing systems often fail in long-narrative contexts due to factual overreach and stylistic monotony. Factual overreach occurs when characters access information outside their narrative perspective, while stylistic monotony flattens character voices through static profile descriptions. To address these issues, the authors propose REVERIEMEM, a three-layer memory architecture designed for book-based character agents. This system utilizes an episodic layer for first-person scene memories, a semantic layer for visibility-tagged facts, and a personality layer for situation-dependent behavioral patterns. The researchers also introduce KBF-QA, a benchmark consisting of 4,386 questions across eight novels to test knowledge boundaries. Experimental results show that REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points compared to prior methods. Additionally, the model achieves approximately a 79% win rate on BOOKWORLD's five-dimension pairwise narrative protocol. These findings suggest that perspective-bounded memory effectively enhances both factual accuracy and character-grounded narrative generation.

arxiv arXiv cs.CL · 11h ago

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

The authors propose MedGuards, a medical safety guardrail framework designed to detect and correct errors in text generated by Large Language Models. This system treats error handling as a multi-agent in-context learning task where specialized agents separately perform detection, localization, and correction. A confidence-guided arbitration mechanism resolves disagreements among agents using reasoning traces and confidence scores without requiring additional model training. The study introduces the Keyword-Prioritized Correction Score (KPCS), a new metric that evaluates the accuracy of critical keywords within reference text. Experiments conducted across four multilingual medical datasets of clinical notes demonstrate significant improvements in performance metrics. These results highlight enhanced interpretability, robustness, and adaptability for safer LLM deployment in healthcare. The code for the MedErrBench benchmark is publicly available on GitHub.

github llama.cpp · 11h ago

llama.cpp b9786 Release Adds OpenCL Non-Contiguous Row Support

The llama.cpp project has released version b9786, introducing support for non-contiguous rows in normalization via OpenCL. This update is part of the ongoing development by the ggml-org team to enhance hardware compatibility and performance across various platforms. The release provides binaries for macOS Apple Silicon, Intel Macs, and iOS XCFrameworks. Linux users can access builds for Ubuntu x64, arm64, and s390x architectures using CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends. Android support is available for arm64 CPU devices, while Windows offers extensive options including CPU, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also lists disabled builds for KleidiAI on macOS and openEuler platforms.

arxiv arXiv cs.CL · 11h ago

Framework Evaluates When GraphRAG and Agentic RAG Are Needed

The authors introduce a framework for evaluating and comparing regular, GraphRAG, Modular, and Agentic Retrieval-Augmented Generation (RAG) on semi-structured knowledge bases. They implement nine standardized scenarios spanning simple document retrieval to complex hybrid text-graph integration and agentic multi-step planning. A novel context engineering method is presented to address memory overflow issues in advanced RAG variants through new representations and agentic loop design. This optimization achieves a 19% to 53% reduction in token usage while efficiently managing retrievals. Further analysis reveals a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality. The study suggests that current retrieval-oriented metrics may overstate the benefits of advanced retrieval techniques. These data-driven insights aim to guide the development of production-ready intelligent RAG systems.

arxiv arXiv cs.CL · 11h ago

BITEMBED: Extreme Low-Bit Framework for LLM-Based Text Embeddings

The paper introduces BITEMBED, an extreme low-bit framework designed to address the high deployment costs of LLM-based text embedders by targeting both encoding efficiency and vector storage. The method converts pretrained LLM backbones into BitNet-style encoders featuring ternary weights, quantized activations, and lightweight normalization refinement. To adapt these models for representation learning, BITEMBED employs continual contrastive pre-training followed by supervised contrastive fine-tuning. This fine-tuning process utilizes similarity-distribution distillation and attention-relation distillation from a full-precision teacher model. Beyond backbone quantization, the framework trains output embeddings to support multiple storage precisions, allowing for flexible trade-offs between performance and storage costs. Experiments on the MMTEB benchmark using Qwen3-0.6B and Gemma3-270M demonstrate that BITEMBED performs largely comparably to full-precision teacher embedders.

arxiv arXiv cs.CL · 11h ago

TRACE: Lightweight Detection of Corpus Poisoning in RAG via Token Influence Attribution

Retrieval-Augmented Generation systems face significant risks from corpus poisoning attacks that manipulate outputs through malicious documents. Existing detection methods often require auxiliary classifiers or additional LLM verification, which introduces substantial computational overhead. To address this, researchers introduced TRACE, a lightweight framework that identifies poisoning by tracing answer-related tokens via influence attribution. The system first discovers recurrent high-influence keywords across retrieved documents to flag potential threats. It then performs secondary verification to confirm the specific influence of these tokens on model predictions. Experiments conducted on three QA benchmarks and six LLMs demonstrate strong detection performance for the framework. Additionally, TRACE successfully uncovers attacker-specified target answers during the verification process.

arxiv arXiv cs.CL · 11h ago

RAS: Measuring LLM Safety Through Refusal Alignment

The authors propose SafeVec, a white-box evaluation procedure that measures LLM safety using internal representations instead of generated outputs. This method extracts layer-wise refusal directions from a safety-aligned reference model to identify stable layers where safe and unsafe behaviors are separable. It then scores target models by checking if their hidden states align with these refusal directions during unsafe prompts. The resulting metric, RAS (Refusal Alignment Score), maps this alignment to a calibrated 0-100 safety score. Experiments across Llama, Gemma, and Qwen families show RAS effectively separates aligned models from uncensored variants. Additionally, the metric tracks output-level attack success rates while being substantially faster than judge-based evaluations. These findings suggest refusal alignment offers a compact and efficient signal for white-box safety assessment.

arxiv arXiv cs.CL · 12h ago

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

The OPERA framework addresses the instability of applying reinforcement learning to open-ended tasks by replacing external judge models with intrinsic rewards derived from perplexity dynamics. This approach quantifies uncertainty reduction at critical reflective states, eliminating stylistic biases and positional inconsistencies common in LLM-as-a-judge systems. During the cold-start phase, the method utilizes guiding words to synthesize diverse reasoning traces and employs perplexity-prioritized rollouts to identify logically consistent branches. This pipeline generates a large-scale dataset of 20,000 high-quality reasoning trajectories for training. Implementing OPERA on the Qwen3-8B model establishes a new state-of-the-art among open-source models. The system achieves parity with or surpasses proprietary models like Gemini2.5 and MiniMax-M2.5 in specific open-ended tasks. Empirical evaluations confirm the scalability and efficacy of this objective perplexity-based alignment strategy.

arxiv arXiv cs.CL · 12h ago

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

This study evaluates whether fine-tuned ModernBERT encoder classifiers can serve as cost-effective alternatives to LLM-based judges for safety evaluation. The researchers benchmarked ModernBERT and Ettin against rule-based prefix matching, fine-tuned LLM classifiers, and various LLM judge methodologies. These LLM judges included strategies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, Claude-as-a-judge, and models like LlamaGuard 3 and 4. The encoder classifiers were trained on judge-labeled data using a majority-voting label strategy and tested on a gold-standard holdout dataset. Performance was measured using F1 score, false negative rate, and precision-recall metrics across open-source adversarial datasets. Results were further analyzed by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation. The findings provide guidance on when encoder classifiers can reliably replace LLM-based judges without substantial performance loss.

media Hugging Face Forums · 12h ago

Niodoo: A Local Runtime for Hidden State Steering of Frozen LLMs

Jason Van Pham has released Niodoo, a local runtime designed to steer frozen large language models through their hidden states. The project aims to correct last-step errors by injecting noise or "physics forces" during inference to break token loops. This approach allows smaller models to improve performance without fine-tuning, targeting specific failure cases like the Llama strawberry prompt benchmark. The system generates its own telemetry tags and utilizes TDA analysis to monitor internal model states for looping behavior. Van Pham developed this tool organically through months of self-directed research and red-teaming, emphasizing reproducible results from pinned hashes. The code is available on GitHub under the repository Ruffian-L/niodoo-hidden-state-steering.