Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 45

AI Exposure Scores: Limitations of Static Metrics and the Need for Research-Policy Coordination

Exposure scores from Eloundou et al. (2023) define AI exposure as the share of occupational tasks large language models can assist with, becoming a central input in future-of-work debates. These static measures suffer from temporal, geographic, and ontological limitations that often fail to travel with them into policy analyses. The authors identify two primary gaps: structural mismatches between static scores and dynamic policy needs, and insufficient coordination between researchers and policymakers. To address measurement limits, the article surveys five research families including dynamic benchmarks, ensemble methods, task-framework extensions, worker-centered metrics, and adoption data. The second gap requires deliberate political work to reimagine future outcomes rather than relying solely on better measurement. Policymakers must widen their evidence base, engage workers as partners, and shift from prediction to preparedness. Researchers are urged to build data infrastructure, adopt participatory methods, and write with policymakers in mind.

arxiv arXiv cs.AI · 12h ago

PsyBridge: A Hybrid Framework for Multi-Dimensional Mental Health Assessment

The study introduces PsyBridge, a hybrid intelligent framework designed to address the limitations of isolated screening instruments in mental health assessment. This system integrates clinically validated tools like PHQ-9 and GAD-7 with cognitive evaluation and personality profiling within a unified architecture. A modular design employing a weighted aggregation mechanism generates interpretable risk classifications and recommendations for users. To evaluate performance, researchers constructed a semi-synthetic dataset comprising 500 patient profiles based on clinically grounded score distributions. Experimental results show that PsyBridge achieves an overall accuracy of 0.84, outperforming standalone PHQ-9 and GAD-7 assessments. The framework also demonstrates improvements in precision, recall, and F1-score compared to existing methods. Sensitivity analysis confirms that integrating cognitive and personality components stabilizes classification performance and reduces prediction inconsistencies. These findings suggest PsyBridge offers a scalable approach for AI-assisted decision support in digital healthcare environments.

arxiv arXiv cs.LG · 13h ago

Entropy-Guided Boundary Supervision for Breast Ultrasound Segmentation

This study introduces an entropy-guided boundary supervision method to address boundary leakage and false-positive activations in breast ultrasound segmentation. The proposed loss function scales contour penalties by per-pixel predictive entropy and ground-truth maps, focusing gradient emphasis on uncertain lesion margins. Evaluated on the BUSI dataset, the method preserved lesion segmentation quality with a mean Dice score of 0.7624, statistically indistinguishable from the baseline. However, it significantly improved specificity by reducing false-positive activations on no-lesion images from 19 of 20 to 5 of 20. A post-hoc spatial temperature scaling step further reduced the expected calibration error from 0.0201 to 0.0095 without altering segmentation masks. These results demonstrate that entropy-guided supervision and spatial calibration function as complementary refinements within a U-Net framework.

arxiv arXiv cs.LG · 13h ago

Diffusion Integrated Gradients: Controllable Path Generation for Flexible Feature Attribution

The authors propose Diffusion Integrated Gradients (DiffIG), a novel method that reformulates path generation as a conditional generative modeling problem to address limitations in existing attribution techniques. While integrated gradients are widely used, their reliance on fixed or hand-crafted paths often results in noisy or distorted attributions. To solve this, DiffIG trains a diffusion model to learn a distribution over paths derived from a Stick-Breaking Process. The method then employs guided sampling to allow for the embedding of user guidance during the inference-time sampling procedure. This approach enables flexible and controllable feature attribution by treating path selection as a generative task rather than a static choice. Experimental results demonstrate that DiffIG quantitatively matches or outperforms existing path-based methods in terms of attribution quality. Furthermore, the generated explanations are shown to be perceptually aligned with human expectations. The work introduces a new generative perspective for Explainable Artificial Intelligence that supports dynamic control over explanation paths.

media r/LocalLLaMA · 13h ago

User Reports Strong Performance of siq1 Model on Kebab Bench

A Reddit user has shared results indicating that their model, referred to as siq1, performs very well on the Kebab Bench evaluation. The post highlights the model's capabilities through a demonstration hosted on Hugging Face Spaces. Specifically, the user points to a space titled 'hermes-agent-zerogpu' created by AlexWortega as evidence of this performance. This submission was made by the Reddit user /u/Mysterious_Hearing14 within the r/LocalLLaMA community. The original post includes a link to the Hugging Face interface where the model can be tested. Additionally, a video demonstration is available via a provided V.redd.it link for further verification.

arxiv arXiv cs.LG · 14h ago

Null-Calibrated Conformal Selection via Target-Membership Scores

The article introduces Null-Calibrated Conformal Selection (NCCS), a method that utilizes target-membership probability scores to identify test candidates within a target region while controlling the false discovery rate. The authors argue that these membership scores provide a more natural ranking for selection tasks than conventional prediction-oriented nonconformity scores, particularly for complex targets. This distinction is critical for interval-valued, variance-driven, multimodal, or multi-condition targets where traditional scores may be misaligned with selection power. NCCS ranks test scores against confirmed non-target calibration examples to yield finite-sample valid null p-values under null exchangeability. These p-values can be combined with the Benjamini-Yekutieli procedure under arbitrary dependence or the Benjamini-Hochberg procedure under standard positive-dependence conditions. Experiments demonstrate that membership scores match conventional scores on mean-monotone targets but substantially improve performance on variance-driven targets. In rare-target regimes, NCCS trades power for finite-sample null validity, addressing issues where direct empirical-FDP thresholding can be anti-conservative.

arxiv arXiv cs.LG · 14h ago

RoboMME-Interference Benchmarks Robot Memory Under Distraction

The introduction of RoboMME-Interference addresses the need for evaluating robot memory in realistic, long-context scenarios where systems must recall information from multiple sessions ago. This new cross-session benchmark is built upon the existing RoboMME framework to measure performance when robots face distractions from unrelated prior experiences. For each query episode, the benchmark constructs a session history consisting of relevant demonstrations followed by a controlled number of unrelated sessions provided as memory to Vision-Language-Action models. Researchers tested released memory-augmented variants of the π_0.5 model without modification to assess their robustness under these conditions. The results indicate that while perceptual memory variants improve success rates when no distractors are present, their accuracy decays steadily and strongly as unrelated sessions accumulate. These findings highlight a critical failure in current systems regarding long-context memory and interference resistance. The project page, videos, code, and data for this benchmark are available at https://robotmemorybench.com.

arxiv arXiv cs.LG · 14h ago

Flow Annealing Posterior Sampling for Function-Space Regression and Inverse Problems

The authors introduce Flow Annealing Posterior Sampling (FAPS), a novel framework that unifies stochastic-process regression with PDE inverse problems in function space. Built upon pretrained function-space flow-matching priors, FAPS facilitates likelihood-guided posterior inference using sparse and noisy observations. The method supports variable query discretizations and avoids the need for explicit prior-density evaluation during sampling. It employs a Langevin correction mechanism that utilizes a low-rank covariance preconditioner to exploit dominant function-space correlations across different discretizations. Benchmarks on both Gaussian and non-Gaussian stochastic processes demonstrate that FAPS produces coherent posterior samples with accurate uncertainty quantification. The approach significantly outperforms existing functional regression baselines in these standard tasks. Furthermore, it achieves competitive or superior performance in noisy PDE inverse problems compared to diffusion-based samplers while reducing test-time sampling costs.

media r/LocalLLaMA · 15h ago

User Reports Inferior Quality and Efficiency with MTP Models in Qwen 3.6 and Gemma 4

A user testing self-hosted Qwen 3.6 27B and Gemma 4 models on four RTX 5070 Ti cards reports that Multi-Token Prediction (MTP) degrades output quality compared to non-MTP variants. In code review tasks, the non-MTP model produced more detailed findings with fix suggestions while consuming fewer tokens than its MTP counterpart. Performance metrics showed the non-MTP setup achieving approximately 2000 prompt processing tokens per second and 50-60 token generation speed. Conversely, the MTP configuration yielded higher generation speeds of 100-120 tg/s but lower prompt processing rates around 1300 pp/s. Despite the higher generation throughput, real-world agent task completion times were only about 20% faster with MTP due to increased context consumption. The user utilized llama.cpp with specific GGUF files from Unsloth and noted similar negative experiences when testing Gemma 4.

arxiv arXiv cs.CL · 15h ago

HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

The HIPE-2026 campaign addresses the challenge of extracting person-place relations from noisy, multilingual historical documents. Moving beyond previous editions focused on named entity recognition, this third iteration targets temporally grounded relationships labeled as 'at' and 'isAt'. The evaluation involved 17 participating teams processing data in French, German, and English across three distinct datasets. These datasets comprised nineteenth and twentieth-century newspaper text alongside a surprise domain set of early modern French literary works. A key feature of the campaign was its three-fold framework assessing predictive accuracy, computational efficiency, and cross-domain generalization. Results from over 40 submitted runs demonstrated a wide variety of strategies, ranging from large language models to lightweight classifiers. The findings highlight the inherent trade-offs between accuracy, efficiency, and robustness in large-scale historical relation extraction.

arxiv arXiv cs.CL · 16h ago

SpeechEQ: Benchmarking Emotional Intelligence in Socially Aware Voice Conversational Models

The authors introduce SpeechEQ, a comprehensive framework designed to evaluate the sociolinguistic reasoning of Speech-Language Models. Existing evaluations often overlook the complex cross-modal reasoning required for active dialogue by relying on isolated text or passive acoustic perception. The framework includes a validated dataset of 2,265 dialogues across 15 Emotional Quotient subscales grounded in EQ-i 2.0 theory. It also features a multi-turn evaluation protocol measured by the proposed Spoken EQ score, which is inspired by human EQ assessments. Experiments reveal limitations in how both Speech Emotion Recognition and end-to-end models understand paralinguistic cues through speech. While end-to-end architectures outperform cascaded systems, current multimodal models remain bottlenecked by several specific issues. These barriers include a text-reliant modality shortcut, an alignment-induced safety trap, and contextual amnesia.

arxiv arXiv cs.CL · 16h ago

Study finds readers prefer human over AI literary translations despite adequate machine quality

A recent study investigates reader preferences regarding AI versus human translations of literary works, noting that while automatic metrics often favor machine output, they fail to capture immersive and literary effects. Researchers asked 15 avid readers to compare human translations against those generated by an agentic LLM pipeline for 15 novels in French, Polish, and Japanese. The evaluation involved approximately 8K-word excerpts through both immersive reading of whole texts and close reading of aligned chunk pairs. Results showed that while readers found machine translations adequate, they significantly preferred human versions for their clarity and ease of immersion. Notably, participants could not reliably distinguish between the two types of translation and tended to favor whichever version they believed was human-made. To support future research, the authors released LAIT, a reader-centered dataset containing 1K comments, 2K judgments, and 7.2K span-level annotations.

arxiv arXiv cs.CL · 17h ago

Evaluating OCR-Reasoning Robustness of Vision-Language Models Under Visual Perturbations

The authors introduce OCR-Robust, a benchmark designed to evaluate the robustness of vision-language models during OCR reasoning tasks under visual perturbations. The dataset comprises 812 samples divided into two subsets: OCR1.0, which covers documents and handwriting, and OCR2.0, focusing on charts and tables. A pilot study identified five representative perturbation types at three severity levels to ensure efficient evaluation. The study benchmarks 18 models, including proprietary systems and open-source VLMs, using metrics like Relative Corruption Retention and Worst-Case Retention. Results indicate that higher clean accuracy does not necessarily correlate with stronger robustness against visual degradation. Furthermore, the analysis reveals that charts and tables are substantially more fragile than document-like inputs when subjected to these perturbations.

arxiv arXiv cs.CL · 17h ago

Keyword Lexicon Blindness Distorts Rhetorical Stance Measurement

A study analyzing 85 interviews with four public intellectuals reveals that keyword-based scoring can produce statistical artifacts regarding rhetorical stance. Initial analysis showed a robust negative-affect and emphatic-certainty co-occurrence pattern with high correlation coefficients ranging from r = 0.72 to 0.93. However, replacing this method with LLM-based zero-shot semantic classification on the full diarized corpus of 32,625 sentences significantly reduced these correlations. For instance, Dalio's correlation dropped from 0.851 to 0.206, while other speakers exhibited negative or null relationships between negativity and certainty. In contrast, the LLM analysis revealed a strong coupling between negative sentiment and hedging language, aligning with conventional expectations of pessimistic discourse. The discrepancy stems from three structural failures in keyword lexicons: syntactic blindness, polysemy blindness, and categorical absence. These flaws can invert semantic meaning, such as scoring 'never absolutely totally confident' as high certainty. The authors argue that keyword counts measure lexical co-occurrence tendencies rather than epistemic certainty, constituting a category error.

arxiv arXiv cs.CL · 17h ago

Auditing Order Sensitivity in Multimodal Large Language Models

The study introduces Facet-Probe, a five-facet audit of 18 frontier and open-weight multimodal large language models to assess order sensitivity. Standard benchmarks often miss whether shuffling evidence changes answers, a reliability property highlighted by emerging AI evaluation guidelines. Using a Bayesian item-response model, the researchers separated ordering noise from per-facet bias and estimated decoder-stochastic floors via same-ordering controls. The audit found that none of the 18 models are order-invariant, with panel-mean flip rates spanning 24-50% across different facets. Even the best-performing model flipped its answer on 13.4% of trials, indicating that higher capability does not eliminate this vulnerability. Mitigation tests using training-free prompt changes proved modality-conditional and failed to transfer between text and visual reasoning tasks. These findings suggest that prompt-level fixes are insufficient for general order robustness, motivating architectural solutions. The authors propose cross-ordering flip rate as a standard reporting axis for future MLLM evaluations.

arxiv arXiv cs.CL · 17h ago

Real-Time Voice AI Hears but Does Not Listen

A study evaluates four leading production real-time voice systems: OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash. The research focuses on tasks where both words and vocal delivery convey meaningful information across three consequential scenarios. All four systems act on the literal words rather than the voice, leading to errors such as ending calls with crying users who insist nothing is wrong or approving wire transfers made in frightened voices. Surprisingly, this disconnect is often not a failure of perception, as three of the four systems can reliably identify distress, fear, or sarcasm when asked directly. Despite this awareness, the models ignore these emotional cues during decision-making, exhibiting what the authors term the 'emotional intelligence gap.' The study also notes that systems estimate accent and age based on word biases rather than acoustic properties. Prompting the systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. These findings suggest current real-time voice AI behaves as if speech were reduced to a transcript, warranting caution in settings where tone is critical.

media r/LocalLLaMA · 18h ago

Reddit Inquiry on Running Large Models with 4x-8x RTX 6000 PROs

A Reddit user is seeking community feedback regarding the performance of large language models on systems equipped with four to eight NVIDIA RTX 6000 PRO GPUs. The inquiry specifically targets users who have between 384GB and 768GB of VRAM available for running models such as GLM 5.2, Kimi 2.7, and DeepSeek V4 Pro. The poster notes that while these models can technically run at 4-bit quantization, they may not fit within the memory constraints when using 8-bit precision. They reference a benchmark repository but highlight that it lacks data for the most recent model releases. A key concern raised is whether the performance degradation from using 4-bit versus 8-bit quantization is significant enough to impact agentic or programming tasks. The user also asks which inference backends, such as vLLM or SGLang, are currently being utilized by others in this hardware configuration.

arxiv arXiv cs.CL · 18h ago

Measuring Research Difficulty in NLP: An Inverted U-Shaped Relationship with Academic Impact

This study proposes a comprehensive evaluation system for measuring the difficulty of academic research, focusing on Natural Language Processing as a case study. The authors extract internal and external features from papers, including collaboration, content, and references, to compute multiple difficulty indicators. These indicators are weighted using the entropy weight method and summed to generate a final research difficulty score. Academic impact is quantified by citation frequency, while expert assessments validate the reliability of the measurement approach. Empirical results indicate that page count, reference count, and high-level institutional participation significantly correlate with academic impact. Crucially, the analysis reveals an inverted U-shaped relationship between research difficulty and impact. This suggests that moderately difficult research tends to achieve the highest level of academic influence.

arxiv arXiv cs.CL · 19h ago

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

Prior research on memory mechanisms in RAG-based conversational systems has primarily focused on storage and retrieval methods. This study investigates how memories with distinct functional roles influence response quality across varying contexts. The authors present a fine-grained taxonomy of conversational memory to classify retrieved items into specific role types. They also design a user-centric evaluation framework that simulates user perspectives to address limitations in reference-based assessments. Comparative experiments were conducted on long-term datasets using frontier large language models to analyze these effects. Results indicate that clarifying memory enhances factual accuracy and constraint awareness, leading to more correct and personalized responses. Conversely, irrelevant memory was found to reduce topic relevance and degrade constraint awareness capabilities. These findings demonstrate how different memory types can be leveraged to improve personalization in conversational agents.

arxiv arXiv cs.CL · 19h ago

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Sarashina2.2-TTS is a Japanese-centric LLM-based text-to-speech system designed to address the linguistic challenge of context-dependent kanji polyphony. The model scales training data to approximately 361k hours, utilizing a balanced mix of Japanese and English speech corpora. To specifically handle reading disambiguation, the authors implemented a targeted data augmentation pipeline covering all 2,136 Joyo regular-use kanji. Alongside the model release, the paper introduces the Joyo Kanji Yomi Benchmark, which includes 4,378 distinct readings for these characters. The authors also propose Kana-CER, a metric that evaluates pronunciation correctness by comparing synthesized speech against reference readings in kana space. Experimental results show that this targeted augmentation significantly improves reading accuracy and achieves state-of-the-art kanji-level performance. The system matches top baselines on general sentence-level pronunciation while delivering the highest speaker similarity in zero-shot synthesis scenarios. Furthermore, cross-lingual evaluations confirm that the balanced training approach ensures stable Japanese pronunciation regardless of the prompt language used.