Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 45

Study finds readers prefer human over AI literary translations despite adequate machine quality

A recent study investigates reader preferences regarding AI versus human translations of literary works, noting that while automatic metrics often favor machine output, they fail to capture immersive and literary effects. Researchers asked 15 avid readers to compare human translations against those generated by an agentic LLM pipeline for 15 novels in French, Polish, and Japanese. The evaluation involved approximately 8K-word excerpts through both immersive reading of whole texts and close reading of aligned chunk pairs. Results showed that while readers found machine translations adequate, they significantly preferred human versions for their clarity and ease of immersion. Notably, participants could not reliably distinguish between the two types of translation and tended to favor whichever version they believed was human-made. To support future research, the authors released LAIT, a reader-centered dataset containing 1K comments, 2K judgments, and 7.2K span-level annotations.

arxiv arXiv cs.CL · 12h ago

Evaluating OCR-Reasoning Robustness of Vision-Language Models Under Visual Perturbations

The authors introduce OCR-Robust, a benchmark designed to evaluate the robustness of vision-language models during OCR reasoning tasks under visual perturbations. The dataset comprises 812 samples divided into two subsets: OCR1.0, which covers documents and handwriting, and OCR2.0, focusing on charts and tables. A pilot study identified five representative perturbation types at three severity levels to ensure efficient evaluation. The study benchmarks 18 models, including proprietary systems and open-source VLMs, using metrics like Relative Corruption Retention and Worst-Case Retention. Results indicate that higher clean accuracy does not necessarily correlate with stronger robustness against visual degradation. Furthermore, the analysis reveals that charts and tables are substantially more fragile than document-like inputs when subjected to these perturbations.

arxiv arXiv cs.CL · 12h ago

Keyword Lexicon Blindness Distorts Rhetorical Stance Measurement

A study analyzing 85 interviews with four public intellectuals reveals that keyword-based scoring can produce statistical artifacts regarding rhetorical stance. Initial analysis showed a robust negative-affect and emphatic-certainty co-occurrence pattern with high correlation coefficients ranging from r = 0.72 to 0.93. However, replacing this method with LLM-based zero-shot semantic classification on the full diarized corpus of 32,625 sentences significantly reduced these correlations. For instance, Dalio's correlation dropped from 0.851 to 0.206, while other speakers exhibited negative or null relationships between negativity and certainty. In contrast, the LLM analysis revealed a strong coupling between negative sentiment and hedging language, aligning with conventional expectations of pessimistic discourse. The discrepancy stems from three structural failures in keyword lexicons: syntactic blindness, polysemy blindness, and categorical absence. These flaws can invert semantic meaning, such as scoring 'never absolutely totally confident' as high certainty. The authors argue that keyword counts measure lexical co-occurrence tendencies rather than epistemic certainty, constituting a category error.

arxiv arXiv cs.CL · 12h ago

Auditing Order Sensitivity in Multimodal Large Language Models

The study introduces Facet-Probe, a five-facet audit of 18 frontier and open-weight multimodal large language models to assess order sensitivity. Standard benchmarks often miss whether shuffling evidence changes answers, a reliability property highlighted by emerging AI evaluation guidelines. Using a Bayesian item-response model, the researchers separated ordering noise from per-facet bias and estimated decoder-stochastic floors via same-ordering controls. The audit found that none of the 18 models are order-invariant, with panel-mean flip rates spanning 24-50% across different facets. Even the best-performing model flipped its answer on 13.4% of trials, indicating that higher capability does not eliminate this vulnerability. Mitigation tests using training-free prompt changes proved modality-conditional and failed to transfer between text and visual reasoning tasks. These findings suggest that prompt-level fixes are insufficient for general order robustness, motivating architectural solutions. The authors propose cross-ordering flip rate as a standard reporting axis for future MLLM evaluations.

arxiv arXiv cs.CL · 12h ago

Real-Time Voice AI Hears but Does Not Listen

A study evaluates four leading production real-time voice systems: OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash. The research focuses on tasks where both words and vocal delivery convey meaningful information across three consequential scenarios. All four systems act on the literal words rather than the voice, leading to errors such as ending calls with crying users who insist nothing is wrong or approving wire transfers made in frightened voices. Surprisingly, this disconnect is often not a failure of perception, as three of the four systems can reliably identify distress, fear, or sarcasm when asked directly. Despite this awareness, the models ignore these emotional cues during decision-making, exhibiting what the authors term the 'emotional intelligence gap.' The study also notes that systems estimate accent and age based on word biases rather than acoustic properties. Prompting the systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. These findings suggest current real-time voice AI behaves as if speech were reduced to a transcript, warranting caution in settings where tone is critical.

media r/LocalLLaMA · 13h ago

Reddit Inquiry on Running Large Models with 4x-8x RTX 6000 PROs

A Reddit user is seeking community feedback regarding the performance of large language models on systems equipped with four to eight NVIDIA RTX 6000 PRO GPUs. The inquiry specifically targets users who have between 384GB and 768GB of VRAM available for running models such as GLM 5.2, Kimi 2.7, and DeepSeek V4 Pro. The poster notes that while these models can technically run at 4-bit quantization, they may not fit within the memory constraints when using 8-bit precision. They reference a benchmark repository but highlight that it lacks data for the most recent model releases. A key concern raised is whether the performance degradation from using 4-bit versus 8-bit quantization is significant enough to impact agentic or programming tasks. The user also asks which inference backends, such as vLLM or SGLang, are currently being utilized by others in this hardware configuration.

arxiv arXiv cs.CL · 13h ago

Measuring Research Difficulty in NLP: An Inverted U-Shaped Relationship with Academic Impact

This study proposes a comprehensive evaluation system for measuring the difficulty of academic research, focusing on Natural Language Processing as a case study. The authors extract internal and external features from papers, including collaboration, content, and references, to compute multiple difficulty indicators. These indicators are weighted using the entropy weight method and summed to generate a final research difficulty score. Academic impact is quantified by citation frequency, while expert assessments validate the reliability of the measurement approach. Empirical results indicate that page count, reference count, and high-level institutional participation significantly correlate with academic impact. Crucially, the analysis reveals an inverted U-shaped relationship between research difficulty and impact. This suggests that moderately difficult research tends to achieve the highest level of academic influence.

arxiv arXiv cs.CL · 14h ago

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

Prior research on memory mechanisms in RAG-based conversational systems has primarily focused on storage and retrieval methods. This study investigates how memories with distinct functional roles influence response quality across varying contexts. The authors present a fine-grained taxonomy of conversational memory to classify retrieved items into specific role types. They also design a user-centric evaluation framework that simulates user perspectives to address limitations in reference-based assessments. Comparative experiments were conducted on long-term datasets using frontier large language models to analyze these effects. Results indicate that clarifying memory enhances factual accuracy and constraint awareness, leading to more correct and personalized responses. Conversely, irrelevant memory was found to reduce topic relevance and degrade constraint awareness capabilities. These findings demonstrate how different memory types can be leveraged to improve personalization in conversational agents.

arxiv arXiv cs.CL · 14h ago

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Sarashina2.2-TTS is a Japanese-centric LLM-based text-to-speech system designed to address the linguistic challenge of context-dependent kanji polyphony. The model scales training data to approximately 361k hours, utilizing a balanced mix of Japanese and English speech corpora. To specifically handle reading disambiguation, the authors implemented a targeted data augmentation pipeline covering all 2,136 Joyo regular-use kanji. Alongside the model release, the paper introduces the Joyo Kanji Yomi Benchmark, which includes 4,378 distinct readings for these characters. The authors also propose Kana-CER, a metric that evaluates pronunciation correctness by comparing synthesized speech against reference readings in kana space. Experimental results show that this targeted augmentation significantly improves reading accuracy and achieves state-of-the-art kanji-level performance. The system matches top baselines on general sentence-level pronunciation while delivering the highest speaker similarity in zero-shot synthesis scenarios. Furthermore, cross-lingual evaluations confirm that the balanced training approach ensures stable Japanese pronunciation regardless of the prompt language used.

arxiv arXiv cs.CL · 14h ago

Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

This survey synthesizes research on toxicity detection and detoxification strategies specifically designed for multilingual large language models. It catalogs threat models that exploit linguistic variations such as code-switching, orthographic differences, and translation pivots to bypass safety alignments. The authors organize existing work into task formulations like toxic-to-neutral rewriting and classification, alongside various detection approaches including cross-lingual encoders and LLM-based detectors. Mitigation strategies are detailed across data filtering, supervised tuning, decoding-time steering, and the implementation of multilingual guardrails. The analysis highlights persistent challenges in the field, notably uneven language coverage and fragmented evaluation protocols. Furthermore, it addresses the complexity of culturally contingent definitions of harm and the risk that detoxification efforts may suppress legitimate dialectal or identity-related expression.

arxiv arXiv cs.CL · 14h ago

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

This study investigates the dialectal robustness of large language models (LLMs) and speech language models (SLMs) using Japanese dialects as a test case. While LLM-based dialogue systems have advanced, dialectal variation remains a significant challenge, particularly for spoken input processing. The research defines robustness as the ratio of performance on dialectal versus standard inputs to enable fair comparisons across different model types. Experiments reveal that SLM robustness correlates directly with the robustness of their underlying text-based LLM counterparts. Additionally, the study finds that training with dialectal data and fine-tuning the speech encoder both serve to improve robustness in SLMs. These findings clarify how base LLM capabilities affect SLM performance and identify effective strategies for enhancing dialect comprehension.

arxiv arXiv cs.CL · 15h ago

Reclaim Evaluation Shows Lossy Memory Is Worse Than No Memory

A study demonstrates that a language model's memory containing incorrect conclusions is more detrimental than having no memory at all. When models retain stale values while dropping supporting work, they emit confident but wrong answers, whereas empty memories allow for abstention. This phenomenon, termed brittle memory, was observed across seven models where the direction of failure never reversed regardless of task or disposition. The researchers introduced reclaim evaluation to measure correctability by compressing interactions and testing if corrections recover ground truth without using a judge. Results indicate that correctability depends on whether the source information survives compression rather than model capability. A source-first policy, which keeps recomputable sources and drops re-derivable conclusions, restored correctability significantly better than length-matched controls. In chained memory loops, dropped-source errors corrupt downstream steps irreparably, while the proposed fix maintains bounded performance horizons. The findings replicate across three deployed systems and real dialogue data, with a hand-built oracle reaching perfect accuracy.

arxiv arXiv cs.CL · 15h ago

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Traditional evaluations reduce learning to a single aggregate score, obscuring how well knowledge from one example generalizes to others. The authors introduce the Generalization Spectrum, an evaluation framework that measures per-sample generalization by tracking performance across test variants with increasing transfer distance. These variants range from exact recall to implementation transfer across languages and context transfer under narrative reframing. The framework is instantiated on competitive programming using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. Comparisons of canonical learning paradigms show that Reinforcement Learning converts memorization into near-transfer more efficiently than Supervised Fine-Tuning baselines. In-context learning exhibits strong but correspondence-dependent transfer capabilities in this context. Diagnostic profiles reveal that local gains do not necessarily expand the generalization radius for all methods. Specifically, abstractions and hints mainly lift local transfer, while Reference SFT preserves a stronger far-transfer tail than RFT. Furthermore, self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.

arxiv arXiv cs.CL · 15h ago

Fine-Tuned PEGASUS Achieves State-of-the-Art Performance on XL-Sum English Corpus

This paper presents a method for optimizing abstractive text summarization by fine-tuning the PEGASUS model on the XL-Sum English corpus. The objective is to surpass the performance of the baseline mT5 model in generating concise summaries that capture salient ideas without merely extracting sentences. The generated summaries are evaluated using the ROUGE metric, which compares auto-generated outputs against human-created references. The study claims that the fine-tuned PEGASUS model achieves state-of-the-art results on this specific dataset. Quantitative analysis reveals a 4.04% improvement in the ROUGE-1 score compared to the baseline. Additionally, the model demonstrates a significant 15.25% increase in the ROUGE-2 score. Finally, there is a reported 3.39% improvement in the ROUGE-L score, confirming the effectiveness of the fine-tuning approach.

arxiv arXiv cs.CL · 16h ago

Red Teaming Framework Uncovers LLM Faithfulness Vulnerabilities via Multi-Role Architecture

This paper introduces a red teaming framework designed to systematically uncover vulnerabilities in large language model outputs through a multi-role architecture. The system utilizes target, attacker, and jury models to generate adversarial prompts and rigorously evaluate response accuracy and consistency. In a case study on faithfulness evaluation, exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks. The research demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety and identifies how structural constraints shape vulnerability patterns. The framework shows adaptability across diverse evaluation tasks, ranging from English question-answering to Arabic summarization. However, the approach faces challenges in fully automating adversarial prompt generation across different languages. Additionally, experiments reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions.

arxiv arXiv cs.CL · 16h ago

Calibration and Adversarial Robustness of Automated ASR Scoring

This study evaluates the reliability of automated judges used to measure attack success rates in LLM jailbreaks by comparing them against human majority votes. Using 596 human-labeled completions from HarmBench, the authors find that dedicated safety classifiers over-flag with high recall but lower precision, while LLM-as-judges exhibit erratic recall ranging from 0.06 to 0.65. These discrepancies cause significant variability in reported ASR depending on which judge family is employed. The research also highlights sharp differences in robustness, showing that benign framing wrappers can flip LLM-judge decisions between 57% and 100% of the time. In contrast, dedicated classifiers resist such surface attacks but remain vulnerable to white-box GCG attacks, which flipped 70% of confident true positives despite a small optimization budget. A two-annotator audit confirmed that these adversarial flips preserved the underlying harmful content. Consequently, many current ASR metrics are deemed unreliable under deliberate pressure or average conditions. The authors recommend reporting judge precision and recall on human-labeled data and including adversarial checks in future research.

arxiv arXiv cs.CL · 16h ago

STC Improves Arabic Customer Service via MARBERT Sentiment Analysis

Saudi Telecom Company (STC) aims to enhance user satisfaction by leveraging Twitter feedback for sentiment analysis. The study addresses the gap in Arabic Natural Language Processing by training the MARBERT model on a specific dataset of 24,513 tweets. This collection includes 1,437 positive, 13,828 negative, and 5,694 neutral tweets, alongside 1,221 sarcastic and 2,297 indeterminate entries. The primary objective is to analyze these sentiments to improve STC's customer service responsiveness. Performance was evaluated using f1-score, precision, and recall metrics to ensure robust detection of spam and sentiment. Results indicate that the proposed scheme offers promising accuracy compared to existing techniques in the literature.

arxiv arXiv cs.CL · 16h ago

Behavioral Drivers of Rating-Sentiment Incongruence in Sri Lankan Tourism Reviews

This study investigates the incongruence between star ratings and written review sentiments within Sri Lankan tourism attraction reviews. Analyzing a dataset of 16,156 reviews from 2010 to 2023, researchers employed a transformer-based pipeline to derive textual sentiment independently of assigned ratings. The analysis reveals that 18.6% of reviews exhibit incongruence, primarily driven by Conservative Rater and Obligatory 5-Star behaviors. These mismatches vary across venue types, with museums demonstrating the highest rates of divergence. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as key contributors to this phenomenon. The findings demonstrate that star ratings are not interchangeable with textual sentiment and require validation before being used as ground-truth labels in NLP tasks.

arxiv arXiv cs.CL · 16h ago

SWE-Pro Benchmark Reveals Significant Gap Between LLMs and Expert Software Optimization

The SWE-Pro benchmark addresses the lack of realistic evaluation frameworks for software performance optimization by introducing a repository-level dataset derived from 102 expert-written optimizations. Unlike previous benchmarks that oversimplify tasks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage under noise-aware conditions. The study reveals that current Large Language Models struggle significantly with these complex requirements, showing negligible runtime gains and nearly non-existent memory optimizations. In sharp contrast, expert implementations achieved an aggregate speedup of 15.5x and a peak memory reduction of 171.3x across the benchmark tasks. Expert-written improvements were observed in 91.2% of tasks for runtime and 65.7% for peak memory. These findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

arxiv arXiv cs.CL · 16h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.