All articles — korshunov.ai

All articles Page 1 / 91

Introducing corpora Hlava Cor and Hlava AD: Human Label Variation in Coreference and Discourse Relations

Researchers have created two new corpora, Hlava Cor and Hlava AD, to explore human variation in understanding text coherence. These resources contain multiple annotations of Czech texts along with annotators' explanations for their choices. The first corpus, Hlava Cor, consists of 1,024 contexts annotated by three individuals to capture coreference identification differences. It covers pronouns, full noun phrases, and anaphoric adverbials across various text types and grammatical-semantic categories. The second corpus, Hlava AD, comprises 512 contexts annotated by five annotators focusing on discourse relations in attributive and non-attributive constructions. Both corpora achieve an inter-annotator agreement of approximately 60-65 percent. Analysis reveals that lower coreference agreement correlates with automatic model disagreement, indicating higher ambiguity. Annotator comments further highlight varying confidence levels and individual reading strategies.

arxiv arXiv cs.CL · 4h ago

Agent-Authored World Modeling Aligns Training with Decision Needs

The paper introduces Agent-Authored World Modeling (AAWM), a training procedure that addresses the limitations of standard world modeling objectives tied to next-observation prediction. This traditional approach often omits dynamics relevant to an agent's current decision because supervision depends on what a transition reveals rather than what is needed. AAWM constructs supervision directly from the policy's decision needs by having the agent identify necessary environmental understanding at each state. Relevant transition evidence is retrieved across trajectories and synthesized into training targets that capture these decision-oriented dynamics. This method aligns the learning objective with the specific information required before acting, rather than forcing the model to reconstruct the next observation. Experimental results validate AAWM's effectiveness across multiple environments and training settings. The findings demonstrate that decision-aware world-model targets provide a more effective learning signal than conventional next-observation prediction.

arxiv arXiv cs.CL · 4h ago

OscillaTTS: Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

Diffusion-based text-to-speech models have improved speech quality but struggle with sharp prosodic transitions and rapid pitch variations. Existing decoders often use periodic nonlinearities like the Snake activation function, which lack adaptability for abrupt amplitude and frequency changes. To address this, the authors introduce OscillaTTS, a system featuring an adaptive oscillatory nonlinearity. This component enables controllable periodic modulation while ensuring signal stability via a linear bypass mechanism. The study investigates the role of oscillatory inductive bias within diffusion-based TTS decoders. Experiments conducted on the LJSpeech and Emotional Speech Dataset demonstrate consistent improvements in both objective and subjective evaluations. These results indicate that OscillaTTS effectively models expressive prosodic dynamics compared to prior methods.

arxiv arXiv cs.CL · 5h ago

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

This study investigates the dialectal robustness of large language models (LLMs) and speech language models (SLMs) using Japanese dialects as a test case. While LLM-based dialogue systems have advanced, dialectal variation remains a significant challenge, particularly for spoken input processing. The research defines robustness as the ratio of performance on dialectal versus standard inputs to enable fair comparisons across different model types. Experiments reveal that SLM robustness correlates directly with the robustness of their underlying text-based LLM counterparts. Additionally, the study finds that training with dialectal data and fine-tuning the speech encoder both serve to improve robustness in SLMs. These findings clarify how base LLM capabilities affect SLM performance and identify effective strategies for enhancing dialect comprehension.

github llama.cpp · 5h ago

Fix failed unit test cases for conv_3d in SYCL

The llama.cpp repository has addressed a specific issue regarding the SYCL backend. A pull request was submitted to fix the failed unit test cases associated with the conv_3d operation. This update targets the ggml-org/llama.cpp project on GitHub. The changes resolve errors that were previously preventing successful execution of these tests. This fix ensures better stability for users relying on SYCL-based hardware acceleration.

arxiv arXiv cs.CL · 5h ago

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

The authors introduce PolicyAlign, a framework designed to align large language models directly with natural-language safety policies rather than relying on costly supervision data. This approach addresses the mismatch between rapidly evolving safety requirements and conventional data-driven alignment methods. The process begins by synthesizing instructions that violate the specified policy, followed by on-policy self-distillation to internalize the desired behavior. To enhance training stability and data efficiency, the method incorporates Policy-Sensitive Filtering, which selects instructions inducing the largest behavioral shift. Experiments across multiple models demonstrate that PolicyAlign consistently improves safety metrics while maintaining low over-refusal rates and preserving general capabilities. The framework also generalizes effectively to specialized domains such as medical, legal, and financial safety scenarios. The code for this scalable alignment approach is released at https://github.com/Qwen-Applications/PolicyAlign.

arxiv arXiv cs.CL · 5h ago

Translation-Enhanced Speech Encoder Pre-training Improves Speech LLMs

Connecting a pre-trained speech encoder to a Large Language Model creates a structural misalignment because encoders often produce language-specific representations while LLMs operate in a unified, language-agnostic space. The authors argue that incorporating speech translation objectives into the pre-training process provides a principled mechanism to bridge this gap. Unlike monolingual transcription, translation forces the model to learn representations that are independent of specific languages. The study experimentally evaluates the impact of adding these translation objectives during speech encoder pre-training. Results demonstrate that this approach significantly improves cross-modal integration between the speech and text modalities. Consequently, models utilizing translation-enhanced pre-training achieve superior performance across various downstream Speech LLM tasks.

arxiv arXiv cs.CL · 5h ago

Harness Design and Post-Training in LLM Agents

The article examines how tool harness design impacts the post-training of large language model agents. It argues that while agents are routinely post-trained, the scaffolding determining tool exposure is often treated as a fixed detail. Existing algorithms typically assume static environments, ignoring shifts in tools and tasks during deployment. To address this gap, the authors extended ALFWorld to treat harness design as a controllable dimension. This extension supports evaluation under both task and tool environment shifts. The study systematically analyzes harness influence on post-training in in-distribution and out-of-distribution settings. Results show that harness-aware post-training improves performance and enables robust adaptation to new environments. Conversely, minimal design effort leads to drastic performance drops under strong environmental shifts.

arxiv arXiv cs.CL · 5h ago

Reclaim Evaluation Shows Lossy Memory Is Worse Than No Memory

A study demonstrates that a language model's memory containing incorrect conclusions is more detrimental than having no memory at all. When models retain stale values while dropping supporting work, they emit confident but wrong answers, whereas empty memories allow for abstention. This phenomenon, termed brittle memory, was observed across seven models where the direction of failure never reversed regardless of task or disposition. The researchers introduced reclaim evaluation to measure correctability by compressing interactions and testing if corrections recover ground truth without using a judge. Results indicate that correctability depends on whether the source information survives compression rather than model capability. A source-first policy, which keeps recomputable sources and drops re-derivable conclusions, restored correctability significantly better than length-matched controls. In chained memory loops, dropped-source errors corrupt downstream steps irreparably, while the proposed fix maintains bounded performance horizons. The findings replicate across three deployed systems and real dialogue data, with a hand-built oracle reaching perfect accuracy.

arxiv arXiv cs.CL · 5h ago

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Traditional evaluations reduce learning to a single aggregate score, obscuring how well knowledge from one example generalizes to others. The authors introduce the Generalization Spectrum, an evaluation framework that measures per-sample generalization by tracking performance across test variants with increasing transfer distance. These variants range from exact recall to implementation transfer across languages and context transfer under narrative reframing. The framework is instantiated on competitive programming using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. Comparisons of canonical learning paradigms show that Reinforcement Learning converts memorization into near-transfer more efficiently than Supervised Fine-Tuning baselines. In-context learning exhibits strong but correspondence-dependent transfer capabilities in this context. Diagnostic profiles reveal that local gains do not necessarily expand the generalization radius for all methods. Specifically, abstractions and hints mainly lift local transfer, while Reference SFT preserves a stronger far-transfer tail than RFT. Furthermore, self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.

arxiv arXiv cs.CL · 5h ago

Probing Self-Supervised Speech Representations on Mandarin Sub-dialects via Unsupervised Articulatory Analysis

This study investigates how internal phonetic representations in self-supervised speech models behave under fine-grained dialect variation, addressing the limitations of existing probing studies that rely on curated corpora. The authors present a case study using an entirely unlabeled probing pipeline for Mandarin sub-dialects. Phone sequences are generated via a language-agnostic universal phone recognizer and mapped to articulatory feature vectors, enabling frame-level probing without manual annotation. Results reveal structured patterns in articulatory feature decodability across different Mandarin dialects. Acoustically salient features like labiality and stridency remain comparatively stable, while those associated with finer spectral distinctions show larger dialect-dependent variation. This variation is driven primarily by elevated decodability for Beijing speech relative to other sub-dialects. Layer-wise analyses demonstrate distinct representational dynamics for these feature groups, suggesting uneven dialect sensitivity across articulatory dimensions.

arxiv arXiv cs.CL · 5h ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

The authors propose an end-to-end, fully differentiable neural architecture designed specifically for phoneme alignment to address the stagnation in this field compared to ASR advancements. The model features an encoder with two complementary branches dedicated to phoneme identity verification and boundary detection. A decoder implemented as a trainable module based on differentiable soft dynamic programming produces the final alignment decisions. The entire system is optimized using a novel contrastive loss that encourages clear separation between steady-state phoneme regions and transition boundaries. Experimental results show the approach outperforms current state-of-the-art methods on hand-annotated English benchmarks. Additionally, the model demonstrates strong word-level generalization capabilities and effective performance on unseen languages.

arxiv arXiv cs.CL · 5h ago

Fine-Tuned PEGASUS Achieves State-of-the-Art Performance on XL-Sum English Corpus

This paper presents a method for optimizing abstractive text summarization by fine-tuning the PEGASUS model on the XL-Sum English corpus. The objective is to surpass the performance of the baseline mT5 model in generating concise summaries that capture salient ideas without merely extracting sentences. The generated summaries are evaluated using the ROUGE metric, which compares auto-generated outputs against human-created references. The study claims that the fine-tuned PEGASUS model achieves state-of-the-art results on this specific dataset. Quantitative analysis reveals a 4.04% improvement in the ROUGE-1 score compared to the baseline. Additionally, the model demonstrates a significant 15.25% increase in the ROUGE-2 score. Finally, there is a reported 3.39% improvement in the ROUGE-L score, confirming the effectiveness of the fine-tuning approach.

arxiv arXiv cs.CL · 6h ago

Red Teaming Framework Uncovers LLM Faithfulness Vulnerabilities via Multi-Role Architecture

This paper introduces a red teaming framework designed to systematically uncover vulnerabilities in large language model outputs through a multi-role architecture. The system utilizes target, attacker, and jury models to generate adversarial prompts and rigorously evaluate response accuracy and consistency. In a case study on faithfulness evaluation, exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks. The research demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety and identifies how structural constraints shape vulnerability patterns. The framework shows adaptability across diverse evaluation tasks, ranging from English question-answering to Arabic summarization. However, the approach faces challenges in fully automating adversarial prompt generation across different languages. Additionally, experiments reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions.

arxiv arXiv cs.CL · 6h ago

Calibration and Adversarial Robustness of Automated ASR Scoring

This study evaluates the reliability of automated judges used to measure attack success rates in LLM jailbreaks by comparing them against human majority votes. Using 596 human-labeled completions from HarmBench, the authors find that dedicated safety classifiers over-flag with high recall but lower precision, while LLM-as-judges exhibit erratic recall ranging from 0.06 to 0.65. These discrepancies cause significant variability in reported ASR depending on which judge family is employed. The research also highlights sharp differences in robustness, showing that benign framing wrappers can flip LLM-judge decisions between 57% and 100% of the time. In contrast, dedicated classifiers resist such surface attacks but remain vulnerable to white-box GCG attacks, which flipped 70% of confident true positives despite a small optimization budget. A two-annotator audit confirmed that these adversarial flips preserved the underlying harmful content. Consequently, many current ASR metrics are deemed unreliable under deliberate pressure or average conditions. The authors recommend reporting judge precision and recall on human-labeled data and including adversarial checks in future research.

arxiv arXiv cs.CL · 6h ago

STC Improves Arabic Customer Service via MARBERT Sentiment Analysis

Saudi Telecom Company (STC) aims to enhance user satisfaction by leveraging Twitter feedback for sentiment analysis. The study addresses the gap in Arabic Natural Language Processing by training the MARBERT model on a specific dataset of 24,513 tweets. This collection includes 1,437 positive, 13,828 negative, and 5,694 neutral tweets, alongside 1,221 sarcastic and 2,297 indeterminate entries. The primary objective is to analyze these sentiments to improve STC's customer service responsiveness. Performance was evaluated using f1-score, precision, and recall metrics to ensure robust detection of spam and sentiment. Results indicate that the proposed scheme offers promising accuracy compared to existing techniques in the literature.

arxiv arXiv cs.CL · 6h ago

Behavioral Drivers of Rating-Sentiment Incongruence in Sri Lankan Tourism Reviews

This study investigates the incongruence between star ratings and written review sentiments within Sri Lankan tourism attraction reviews. Analyzing a dataset of 16,156 reviews from 2010 to 2023, researchers employed a transformer-based pipeline to derive textual sentiment independently of assigned ratings. The analysis reveals that 18.6% of reviews exhibit incongruence, primarily driven by Conservative Rater and Obligatory 5-Star behaviors. These mismatches vary across venue types, with museums demonstrating the highest rates of divergence. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as key contributors to this phenomenon. The findings demonstrate that star ratings are not interchangeable with textual sentiment and require validation before being used as ground-truth labels in NLP tasks.

arxiv arXiv cs.CL · 6h ago

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers introduce the concept of cliff tokens to identify specific single-token failure triggers in large language models during mathematical reasoning tasks. Unlike prior work that analyzes failures at step or sentence levels, this method pinpoints the exact token where potential drops significantly using an adaptive threshold based on a z-test. The study evaluates seven models across three benchmarks: GSM1K, MATH500, and AIME 2025. Deleting the first cliff token and resampling allows recovery of pass@64 to 1.0, whereas keeping it limits recovery between 0.71 and 1.00. The authors propose a taxonomy classifying cliffs as deterministic, uncertain, or sampled-off based on greedy choice and token entropy. This classification generalizes across different model scales and exhibits distinct probabilistic characteristics for each type. Furthermore, the team validates this taxonomy through single-token preference optimization known as Cliff-DPO. Trained on GSM8K, Cliff-DPO improves accuracy by up to +6.6 across benchmarks. Optimization proves effective for uncertain and sampled-off cliffs but yields no improvement for deterministic ones.

arxiv arXiv cs.CL · 6h ago

SWE-Pro Benchmark Reveals Significant Gap Between LLMs and Expert Software Optimization

The SWE-Pro benchmark addresses the lack of realistic evaluation frameworks for software performance optimization by introducing a repository-level dataset derived from 102 expert-written optimizations. Unlike previous benchmarks that oversimplify tasks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage under noise-aware conditions. The study reveals that current Large Language Models struggle significantly with these complex requirements, showing negligible runtime gains and nearly non-existent memory optimizations. In sharp contrast, expert implementations achieved an aggregate speedup of 15.5x and a peak memory reduction of 171.3x across the benchmark tasks. Expert-written improvements were observed in 91.2% of tasks for runtime and 65.7% for peak memory. These findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

arxiv arXiv cs.CL · 6h ago

Security and Privacy in Retrieval-Augmented Generation: Architectures, Threats, Defenses, and Future Directions

This survey examines the security and privacy challenges inherent in Retrieval-Augmented Generation (RAG) systems across centralized, on-device, federated, and hybrid paradigms. It presents a unified taxonomy of threat surfaces that span retrieval, context construction, and generation stages. The analysis covers specific attack classes including membership inference, index inference, poisoning, gradient leakage, and collusion. Sensitive information risks are identified within retrieval indices, query logs, context construction, and federated updates. Adversarial manipulation of knowledge bases is highlighted as a key factor undermining trust in generated outputs. The paper reviews architectural, algorithmic, and cryptographic defenses while addressing privacy-utility trade-offs. Finally, it outlines open research challenges for building trustworthy and resilient RAG systems.