Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 44

Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

This survey synthesizes research on toxicity detection and detoxification strategies specifically designed for multilingual large language models. It catalogs threat models that exploit linguistic variations such as code-switching, orthographic differences, and translation pivots to bypass safety alignments. The authors organize existing work into task formulations like toxic-to-neutral rewriting and classification, alongside various detection approaches including cross-lingual encoders and LLM-based detectors. Mitigation strategies are detailed across data filtering, supervised tuning, decoding-time steering, and the implementation of multilingual guardrails. The analysis highlights persistent challenges in the field, notably uneven language coverage and fragmented evaluation protocols. Furthermore, it addresses the complexity of culturally contingent definitions of harm and the risk that detoxification efforts may suppress legitimate dialectal or identity-related expression.

arxiv arXiv cs.CL · 12h ago

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

This study investigates the dialectal robustness of large language models (LLMs) and speech language models (SLMs) using Japanese dialects as a test case. While LLM-based dialogue systems have advanced, dialectal variation remains a significant challenge, particularly for spoken input processing. The research defines robustness as the ratio of performance on dialectal versus standard inputs to enable fair comparisons across different model types. Experiments reveal that SLM robustness correlates directly with the robustness of their underlying text-based LLM counterparts. Additionally, the study finds that training with dialectal data and fine-tuning the speech encoder both serve to improve robustness in SLMs. These findings clarify how base LLM capabilities affect SLM performance and identify effective strategies for enhancing dialect comprehension.

arxiv arXiv cs.CL · 13h ago

Reclaim Evaluation Shows Lossy Memory Is Worse Than No Memory

A study demonstrates that a language model's memory containing incorrect conclusions is more detrimental than having no memory at all. When models retain stale values while dropping supporting work, they emit confident but wrong answers, whereas empty memories allow for abstention. This phenomenon, termed brittle memory, was observed across seven models where the direction of failure never reversed regardless of task or disposition. The researchers introduced reclaim evaluation to measure correctability by compressing interactions and testing if corrections recover ground truth without using a judge. Results indicate that correctability depends on whether the source information survives compression rather than model capability. A source-first policy, which keeps recomputable sources and drops re-derivable conclusions, restored correctability significantly better than length-matched controls. In chained memory loops, dropped-source errors corrupt downstream steps irreparably, while the proposed fix maintains bounded performance horizons. The findings replicate across three deployed systems and real dialogue data, with a hand-built oracle reaching perfect accuracy.

arxiv arXiv cs.CL · 13h ago

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Traditional evaluations reduce learning to a single aggregate score, obscuring how well knowledge from one example generalizes to others. The authors introduce the Generalization Spectrum, an evaluation framework that measures per-sample generalization by tracking performance across test variants with increasing transfer distance. These variants range from exact recall to implementation transfer across languages and context transfer under narrative reframing. The framework is instantiated on competitive programming using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. Comparisons of canonical learning paradigms show that Reinforcement Learning converts memorization into near-transfer more efficiently than Supervised Fine-Tuning baselines. In-context learning exhibits strong but correspondence-dependent transfer capabilities in this context. Diagnostic profiles reveal that local gains do not necessarily expand the generalization radius for all methods. Specifically, abstractions and hints mainly lift local transfer, while Reference SFT preserves a stronger far-transfer tail than RFT. Furthermore, self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.

arxiv arXiv cs.CL · 13h ago

Fine-Tuned PEGASUS Achieves State-of-the-Art Performance on XL-Sum English Corpus

This paper presents a method for optimizing abstractive text summarization by fine-tuning the PEGASUS model on the XL-Sum English corpus. The objective is to surpass the performance of the baseline mT5 model in generating concise summaries that capture salient ideas without merely extracting sentences. The generated summaries are evaluated using the ROUGE metric, which compares auto-generated outputs against human-created references. The study claims that the fine-tuned PEGASUS model achieves state-of-the-art results on this specific dataset. Quantitative analysis reveals a 4.04% improvement in the ROUGE-1 score compared to the baseline. Additionally, the model demonstrates a significant 15.25% increase in the ROUGE-2 score. Finally, there is a reported 3.39% improvement in the ROUGE-L score, confirming the effectiveness of the fine-tuning approach.

arxiv arXiv cs.CL · 13h ago

Red Teaming Framework Uncovers LLM Faithfulness Vulnerabilities via Multi-Role Architecture

This paper introduces a red teaming framework designed to systematically uncover vulnerabilities in large language model outputs through a multi-role architecture. The system utilizes target, attacker, and jury models to generate adversarial prompts and rigorously evaluate response accuracy and consistency. In a case study on faithfulness evaluation, exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks. The research demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety and identifies how structural constraints shape vulnerability patterns. The framework shows adaptability across diverse evaluation tasks, ranging from English question-answering to Arabic summarization. However, the approach faces challenges in fully automating adversarial prompt generation across different languages. Additionally, experiments reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions.

arxiv arXiv cs.CL · 13h ago

Calibration and Adversarial Robustness of Automated ASR Scoring

This study evaluates the reliability of automated judges used to measure attack success rates in LLM jailbreaks by comparing them against human majority votes. Using 596 human-labeled completions from HarmBench, the authors find that dedicated safety classifiers over-flag with high recall but lower precision, while LLM-as-judges exhibit erratic recall ranging from 0.06 to 0.65. These discrepancies cause significant variability in reported ASR depending on which judge family is employed. The research also highlights sharp differences in robustness, showing that benign framing wrappers can flip LLM-judge decisions between 57% and 100% of the time. In contrast, dedicated classifiers resist such surface attacks but remain vulnerable to white-box GCG attacks, which flipped 70% of confident true positives despite a small optimization budget. A two-annotator audit confirmed that these adversarial flips preserved the underlying harmful content. Consequently, many current ASR metrics are deemed unreliable under deliberate pressure or average conditions. The authors recommend reporting judge precision and recall on human-labeled data and including adversarial checks in future research.

arxiv arXiv cs.CL · 14h ago

STC Improves Arabic Customer Service via MARBERT Sentiment Analysis

Saudi Telecom Company (STC) aims to enhance user satisfaction by leveraging Twitter feedback for sentiment analysis. The study addresses the gap in Arabic Natural Language Processing by training the MARBERT model on a specific dataset of 24,513 tweets. This collection includes 1,437 positive, 13,828 negative, and 5,694 neutral tweets, alongside 1,221 sarcastic and 2,297 indeterminate entries. The primary objective is to analyze these sentiments to improve STC's customer service responsiveness. Performance was evaluated using f1-score, precision, and recall metrics to ensure robust detection of spam and sentiment. Results indicate that the proposed scheme offers promising accuracy compared to existing techniques in the literature.

arxiv arXiv cs.CL · 14h ago

Behavioral Drivers of Rating-Sentiment Incongruence in Sri Lankan Tourism Reviews

This study investigates the incongruence between star ratings and written review sentiments within Sri Lankan tourism attraction reviews. Analyzing a dataset of 16,156 reviews from 2010 to 2023, researchers employed a transformer-based pipeline to derive textual sentiment independently of assigned ratings. The analysis reveals that 18.6% of reviews exhibit incongruence, primarily driven by Conservative Rater and Obligatory 5-Star behaviors. These mismatches vary across venue types, with museums demonstrating the highest rates of divergence. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as key contributors to this phenomenon. The findings demonstrate that star ratings are not interchangeable with textual sentiment and require validation before being used as ground-truth labels in NLP tasks.

arxiv arXiv cs.CL · 14h ago

SWE-Pro Benchmark Reveals Significant Gap Between LLMs and Expert Software Optimization

The SWE-Pro benchmark addresses the lack of realistic evaluation frameworks for software performance optimization by introducing a repository-level dataset derived from 102 expert-written optimizations. Unlike previous benchmarks that oversimplify tasks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage under noise-aware conditions. The study reveals that current Large Language Models struggle significantly with these complex requirements, showing negligible runtime gains and nearly non-existent memory optimizations. In sharp contrast, expert implementations achieved an aggregate speedup of 15.5x and a peak memory reduction of 171.3x across the benchmark tasks. Expert-written improvements were observed in 91.2% of tasks for runtime and 65.7% for peak memory. These findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

arxiv arXiv cs.CL · 14h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 15h ago

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

The authors propose MedGuards, a medical safety guardrail framework designed to detect and correct errors in text generated by Large Language Models. This system treats error handling as a multi-agent in-context learning task where specialized agents separately perform detection, localization, and correction. A confidence-guided arbitration mechanism resolves disagreements among agents using reasoning traces and confidence scores without requiring additional model training. The study introduces the Keyword-Prioritized Correction Score (KPCS), a new metric that evaluates the accuracy of critical keywords within reference text. Experiments conducted across four multilingual medical datasets of clinical notes demonstrate significant improvements in performance metrics. These results highlight enhanced interpretability, robustness, and adaptability for safer LLM deployment in healthcare. The code for the MedErrBench benchmark is publicly available on GitHub.

arxiv arXiv cs.CL · 15h ago

RAS: Measuring LLM Safety Through Refusal Alignment

The authors propose SafeVec, a white-box evaluation procedure that measures LLM safety using internal representations instead of generated outputs. This method extracts layer-wise refusal directions from a safety-aligned reference model to identify stable layers where safe and unsafe behaviors are separable. It then scores target models by checking if their hidden states align with these refusal directions during unsafe prompts. The resulting metric, RAS (Refusal Alignment Score), maps this alignment to a calibrated 0-100 safety score. Experiments across Llama, Gemma, and Qwen families show RAS effectively separates aligned models from uncensored variants. Additionally, the metric tracks output-level attack success rates while being substantially faster than judge-based evaluations. These findings suggest refusal alignment offers a compact and efficient signal for white-box safety assessment.

arxiv arXiv cs.CL · 15h ago

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

This study evaluates whether fine-tuned ModernBERT encoder classifiers can serve as cost-effective alternatives to LLM-based judges for safety evaluation. The researchers benchmarked ModernBERT and Ettin against rule-based prefix matching, fine-tuned LLM classifiers, and various LLM judge methodologies. These LLM judges included strategies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, Claude-as-a-judge, and models like LlamaGuard 3 and 4. The encoder classifiers were trained on judge-labeled data using a majority-voting label strategy and tested on a gold-standard holdout dataset. Performance was measured using F1 score, false negative rate, and precision-recall metrics across open-source adversarial datasets. Results were further analyzed by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation. The findings provide guidance on when encoder classifiers can reliably replace LLM-based judges without substantial performance loss.

arxiv arXiv cs.CL · 16h ago

Argus Benchmark Evaluates Uncertainty Quantification Stability Across Vision-Language Models and GUI Grounding Datasets

The authors introduce Argus, a benchmark designed to evaluate post-hoc uncertainty quantification for computer-use agents that translate vision-language model predictions into executable GUI actions. The study assesses 28 open-weight methods across four VLM agents and four datasets, alongside eight closed-source methods from three vendors where internal model states are inaccessible. Key findings reveal selective transfer stability, where uncertainty rankings remain consistent across different datasets for a fixed model but degrade significantly when moving between different model classes or observable interfaces. Among open-weight options, hidden-state and density estimation techniques demonstrated the highest stability, while specific regimes favored sampling-based scores or verbalized self-assessment. Within-model ranking transfer proved strong with Spearman rho values up to 0.969, whereas cross-tier transfer to closed-source vendors averaged only +0.08. The research further indicates that conformal click regions shrink radii by 40-60 percent upon calibration but suffer coverage degradation under interface mismatch. To support regime-aware selection, the authors release per-item records, calibration splits, UQ scores, and analysis scripts.

arxiv arXiv cs.CL · 16h ago

How Large Language Models Source Brand Reputation Across Languages and Markets

This study analyzes the citation sources used by large language models when answering questions about brands, focusing on the underlying web references rather than just the generated text. The researchers merged three Rankfor.AI datasets to examine 167,551 URL-grounded citations across 128 brands in 12 home markets and 13 languages. The analysis reveals that AI grounds brand answers overwhelmingly in third-party sources, with 85.7% of citations pointing to sites the brand does not own compared to only 14.3% for owned domains. The source base is highly concentrated and follows a Zipf law, where 80% of citations originate from approximately 18% of domains. Wikipedia emerges as the dominant reference site, being the most-cited domain in 11 of the 12 languages studied. The only exception is Lithuanian, where the business daily vz.lt slightly edges out Wikipedia with a 4.38% share. Additionally, the source mix shows market-specific variations, such as YouTube being the top cited domain for Polish national brands and HR portals supplying more citations than Polish Wikipedia.

arxiv arXiv cs.CL · 16h ago

ToolBench-X: Benchmarking Tool-Using Agents Under Unreliable Environments

The authors introduce ToolBench-X, a new benchmark designed to evaluate large language model agents under recoverable tool-environment unreliability. Unlike existing benchmarks that assume clean and stable environments, this framework injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. The dataset contains executable multi-step tasks across diverse domains with deterministic tools and canonical final answers for automatic evaluation. Crucially, every injected instance remains solvable through valid recovery paths such as retrying, fallback, or verification. Experiments reveal a substantial reliability gap where agents performing well with reliable tools often fail under these hazards. Further analysis indicates that failures stem from limited hazard diagnosis and ineffective recovery rather than tool-use volume or inference budget. Targeted recovery hints successfully recover many failed tasks, whereas test-time scaling yields more limited gains. These findings suggest that evaluation must shift focus from function-call accuracy to task completion in unreliable environments.

arxiv arXiv cs.LG · 17h ago

SAFER: Reliability-Guided Adaptive Ensembling for Robust Test-Time Adaptation

The authors address the brittleness of test-time adaptation (TTA) under adversarially contaminated streams by proposing SAFER, a training-free framework for robust TTA. SAFER acts as an augmentation wrapper that replaces single-view predictions with a reliability-guided pooled predictor to stabilize online updates. For each test sample, the method generates stochastic augmentations and aggregates their outputs using correlation-weighted pooling combined with outlier detection. An adaptive-mixing extension is also introduced, which adjusts the weighting between original and augmented inputs based on feature disagreement signals to preserve clean performance. The researchers evaluated SAFER on PACS, VLCS, and OfficeHome benchmarks under PGD attacks at various rates. Results indicate that SAFER improves the resilience of TTA methods against adversarial attacks while maintaining competitive accuracy on clean data.

arxiv arXiv cs.LG · 17h ago

ORBIT: Training-Free Multi-Attribute Behavioral Steering via Orthogonal Subspace Rotation

The authors introduce ORBIT, a training-free method for simultaneously controlling multiple behavioral attributes in large language models. Existing activation steering techniques struggle with multi-attribute control due to norm imbalance and directional cancellation when using naive vector summation. ORBIT addresses this by constructing a joint subspace from per-attribute steering planes via singular value decomposition. It then applies a single norm-preserving rotation within that subspace toward a combined target direction. The method incorporates adaptive per-token gating to identify necessary corrections at each position and an optional additive boost for weak projections. To evaluate the approach, the authors present TraitFactory, a benchmark focusing on behavioral tendencies rather than surface style. Experiments across Llama-3.2-3B, Qwen-2.5-7B, and Llama-3.1-8B models demonstrate that ORBIT achieves stronger and more balanced steering than baselines while preserving output coherence.

arxiv arXiv cs.LG · 17h ago

Reference-Free Assessment of Physical Consistency in World Model-based Video Generation

The authors introduce reference-free measures for evaluating the physical consistency of generated videos by combining relative and absolute fidelity assessments. This approach addresses the gap in physical fidelity that often prevents video generation tools like WorldGym or WorldEval from accurately reproducing real-world task success rates for VLA models. Unlike existing methods requiring costly human voting or unavailable ground-truth references, the new framework utilizes DROID-SLAM and SEA-RAFT to quantify inconsistencies. Motivated by WorldScore, the relative consistency assessment filters videos to improve task success rates by over 8%. Additionally, the absolute assessment enables spatio-temporal localization to visualize when and where physical artifacts occur in the generated content.