Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 45

LLM-Integrated App Bug Seams Reveal Testing Gaps

A rental-search assistant with LLM features and multi-market support faced persistent user defects despite 1,553 passing automated tests. Analysis of 252 bug-fix commits showed 44% of fixes occurred at four unseen seams: browser runtime, non-default market, end-to-end flows, and whole-system level. A fix without a seam guard caused a defect to ship twice, highlighting the need for targeted testing at these boundaries.

arxiv arXiv cs.LG · 1d ago

The Scissors Effect: Resize Diversity Hurts Robust Surrogate Transfer

Input diversity, a common practice in transfer attacks, improves success on standard surrogates but reduces it on robust ones. This regime-dependent effect, called the Scissors Effect, is driven by gradient geometry, with resize operations degrading alignment in robust models. A training-free rule (CG-DI) adjusts diversity based on local gradient consistency to preserve attack success across surrogate types.

arxiv arXiv cs.LG · 1d ago

HERTA: Automated Testing for FHE Framework Vulnerabilities

HERTA is the first automated testing tool designed for fully homomorphic encryption frameworks. It uses metamorphic testing with novel relations derived from FHE semantics to detect deep-seated logic bugs that can silently corrupt encrypted computations. Evaluation on three industry frameworks revealed 21 previously unknown bugs, several of which have been confirmed and fixed by developers, with significant implications for security and service integrity.

arxiv arXiv cs.LG · 1d ago

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

CCPL introduces a lightweight framework that anchors class prompts to frozen concept prototypes, improving few-shot CLIP adaptation. It achieves better base-to-new performance on DTD and EuroSAT compared to CoOp, with consistent gains from text-space concept regularization, though results vary by dataset and protocol.

arxiv arXiv cs.LG · 1d ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A fine-tuned 0.5B-parameter Qwen2.5 model achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, it reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models in accuracy, demonstrating that task-adapted small models can deliver high performance with minimal hardware and privacy overhead.

arxiv arXiv cs.AI · 1d ago

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

BabelJudge introduces an open-source framework to measure four key bias modes in LLM judges across languages and agent trajectories. It reveals a significant reliability drop from Hindi to Swahili—0.714 to 0's 0.550—highlighting cross-lingual degradation invisible to raw accuracy. The framework enables bias-aware evaluation without human labels, using controlled perturbations to create known gold labels, and extends to agentic workflows with new metrics on tool accuracy and hallucination detection.

arxiv arXiv cs.AI · 1d ago

RoboMME-Interference: Benchmarking Robot Memory Under Interference

RoboMME-Interference introduces a cross-session benchmark to evaluate robot memory under interference. It adds unrelated sessions to prior demonstrations, revealing that perceptual memory variants degrade significantly as distractions increase, highlighting current systems' lack of robustness to interference and the need for long-context memory.

arxiv arXiv cs.AI · 1d ago

Flow Annealing Posterior Sampling for Function-Space Regression and Inverse Problems

FAPS is the first function-space posterior sampling framework that unifies stochastic-process regression and PDE inverse problems. It uses pretrained flow-matching priors and Langevin correction with low-rank covariance preconditioning to enable efficient, accurate posterior inference from sparse, noisy data with coherent uncertainty quantification.

media r/LocalLLaMA · 1d ago

Has anyone else found vLLM outputs worse than llama.cpp?

A user reports noticing less reliable outputs from vLLM compared to llama.cpp, including formatting errors, context forgetting, and lower code quality. They ask whether such differences stem from quantization, chat templates, parser issues, or configuration errors, and seek confirmation if others have observed similar quality discrepancies between inference backends.

arxiv arXiv cs.AI · 1d ago

SAFER: Reliable Test-Time Adaptation under Adversarial Streams

SAFER is a training-free framework that enhances robustness of test-time adaptation by using reliability-guided augmentation. It generates stochastic augmentations, pools predictions via correlation-weighted aggregation with outlier detection, and includes adaptive mixing to preserve clean performance under adversarial attacks. Evaluations on PACS, VLCS, and OfficeHome show improved resilience without sacrificing clean accuracy.

arxiv arXiv cs.AI · 1d ago

Reference-Free Assessment of Physical Consistency in Video Generation

A new method evaluates physical consistency in generated videos without requiring human voting or ground-truth references. It uses DROID-SLAM and SEA-RAFT to detect inconsistencies, improving task success rates by over 8% and enabling spatio-temporal localization of physical artifacts.

arxiv arXiv cs.AI · 1d ago

LLM-Assisted Label Cleaning in Chest CT Dataset

A large language model (LLM) assisted in identifying label-report discordance in the CT-RATE chest CT dataset. GPT-5.4 achieved 96.4% agreement with existing labels, with radiologist adjudication supporting LLM-derived labels in 74.2% of general and 91.9% of lymphadenopathy discordances. Multi-LLM majority-vote labels outperformed others in F1 score and kappa, and the cleaned dataset will be publicly released.

arxiv arXiv cs.AI · 1d ago

PlanBench-XL: Benchmark for Long-Horizon Tool-Use Planning

PlanBench-XL evaluates long-horizon planning in LLM agents across 1,665 tools through 327 retail tasks. It introduces a blocking mechanism to simulate real-world tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and error handling.

arxiv arXiv cs.AI · 1d ago

Gold Points Sniper: Self-guided Visual Reasoning for Fine-grained Action Understanding

Gold Points Sniper (GPS) enables lightweight vision-language models to perform self-guided multimodal reasoning for fine-grained human action understanding. By integrating a Gold Points Extractor, Selective Socratic Questioner, and Semantic Entailment Evaluator, GPS achieves performance comparable to GPT-4o while maintaining superior factual accuracy on CAP benchmark-based instruction-tuning data.

arxiv arXiv cs.AI · 1d ago

Structural Codebase Index Improves Resolve Without Cost Penalty

A structural codebase index in coding agents enhances localization and resolve performance without increasing cost per cell. It outperforms agentic-grep baselines in both metrics and achieves lower cost per solved task, especially in workloads with multi-file changes.

lab Hugging Face Blog · 1d ago

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

The FFASR Leaderboard was launched to evaluate speech recognition systems in real-world conditions. It provides a benchmark for assessing the performance of automatic speech recognition models across diverse environments and use cases.

arxiv arXiv cs.AI · 1d ago

MMGist: A Comprehensive Multimodal Benchmark for 2027

MMGist is a curated multimodal benchmark with 7,262 items, designed to address flaws in existing vision-language benchmarks. It reduces evaluation size by 69% and improves cross-model discrimination by 78%, while preserving model rankings with a Spearman correlation of 0.98. The benchmark highlights visual logic as a key weakness and emphasizes the importance of visual dependency, discriminative power, and reliability in evaluation.

arxiv arXiv cs.AI · 1d ago

Efficient Multimodal Models for Pulmonary Embolism Risk Assessment

A benchmark using efficient multimodal large language models evaluates PE diagnosis and risk prediction on the INSPECT dataset. Results show Gemma4 E4B and E2B outperform others when EHR data is available, with PE diagnosis achieving higher accuracy than prognostic tasks like readmission prediction.

arxiv arXiv cs.AI · 1d ago

A Differentiable Atari VCS for Explainable AI

A fully differentiable emulator of the Atari 2600 VCS is presented, reproducing all 64 ALE games with bit-for-bit accuracy in RAM and screen output. The system enables gradient-based explainable AI by providing a complex, fully known ground truth, with both Julia and JAX implementations validated against a reference emulator and capable of high-throughput differentiable rollouts on GPU.

arxiv arXiv cs.AI · 1d ago

Character Variety in LLM-Generated Stories

This study compares characters in LLM-generated and human-written stories using narratological dimensions. It finds that while LLMs produce characters with similar basic traits, they lack diversity in complex character features like stylization and wholeness. The research highlights key differences in character depth and variety between human and machine-generated narratives.