Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

MMGist: A Comprehensive Multimodal Benchmark for 2027

MMGist is a curated multimodal benchmark with 7,262 items, designed to address flaws in existing vision-language benchmarks. It reduces evaluation size by 69% and improves cross-model discrimination by 78%, while preserving model rankings with a Spearman correlation of 0.98. The benchmark highlights visual logic as a key weakness and emphasizes the importance of visual dependency, discriminative power, and reliability in evaluation.

arxiv arXiv cs.AI · 19h ago

Efficient Multimodal Models for Pulmonary Embolism Risk Assessment

A benchmark using efficient multimodal large language models evaluates PE diagnosis and risk prediction on the INSPECT dataset. Results show Gemma4 E4B and E2B outperform others when EHR data is available, with PE diagnosis achieving higher accuracy than prognostic tasks like readmission prediction.

arxiv arXiv cs.AI · 19h ago

A Differentiable Atari VCS for Explainable AI

A fully differentiable emulator of the Atari 2600 VCS is presented, reproducing all 64 ALE games with bit-for-bit accuracy in RAM and screen output. The system enables gradient-based explainable AI by providing a complex, fully known ground truth, with both Julia and JAX implementations validated against a reference emulator and capable of high-throughput differentiable rollouts on GPU.

arxiv arXiv cs.AI · 19h ago

Character Variety in LLM-Generated Stories

This study compares characters in LLM-generated and human-written stories using narratological dimensions. It finds that while LLMs produce characters with similar basic traits, they lack diversity in complex character features like stylization and wholeness. The research highlights key differences in character depth and variety between human and machine-generated narratives.

arxiv arXiv cs.AI · 20h ago

PRIME: Evaluating Prompt Resolution in Conflicting Instructions

PRIME introduces a framework to analyze how large language models handle conflicting instructions by generating calibrated conflicts in response length, format, and reasoning. The study finds that conflict type has a greater impact on model behavior than model size, revealing diverse failure modes across conflict categories. Results highlight the need for conflict awareness and suggest instruction following cannot be reliably assessed through isolated benchmarks alone.

arxiv arXiv cs.AI · 20h ago

FACTOR Enables Adaptive Verification for Factuality in Long-Form Generation

FACTOR introduces an inference-time model that adapts verification criteria based on claim-level uncertainty. It improves factuality and reduces verification cost by dynamically allocating effort to high-risk claims, demonstrating effective and model-agnostic performance on the FactScore benchmark.

arxiv arXiv cs.AI · 20h ago

LLM-Integrated App Bug Seams Reveal Testing Gaps

A rental-search assistant with LLMs and multi-market support faced persistent user defects despite 1,553 passing automated tests. Analysis of 252 bug-fix commits showed 44% resolved issues in four unseen seams: live browser runtime, non-default market, end-to-end flows, and whole-system level. A simple practice was adopted to identify the seam with most fixes.

arxiv arXiv cs.AI · 21h ago

Hi-Seg: Human-AI Collaboration for Pulmonary Nodule Segmentation

Hi-Seg, a human-in-the-loop framework built on SAM, achieves a mean Dice score of almost 85% in pulmonary nodule segmentation. It outperforms five state-of-the-art deep learning models and 13 SAM variants, with non-medical annotators matching junior medical student performance, reducing clinician workload and enabling scalable annotation.

media r/LocalLLaMA · 21h ago

My micro-benchmark: how good are LLMs at simulating wetting behaviour?

The author benchmarks LLMs in simulating wetting behaviour using Surface Evolver, a 1992 tool for modeling liquid surfaces. LLMs are evaluated objectively by comparing their generated datafiles against reference implementations, with results showing pass counts and token costs for each model.

lab Microsoft Research Blog · 21h ago

Talos: Automated Genomic Reanalysis for Rare Disease Diagnosis

Talos is an open-source tool that automates iterative reanalysis of genomic data to identify rare disease diagnoses. It achieved a 90% recovery rate of in-scope diagnoses with only 1.3 candidate variants per patient, and delivered 241 new diagnoses across 5,000 undiagnosed patients, with most new findings emerging within 32 days of evidence publication.

arxiv arXiv cs.AI · 22h ago

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

The method combines a learnable world model with high- and low-level policies to enable safe exploration in long-horizon tasks. The high-level policy guides exploration toward safe subgoals, while the low-level policy uses imagined rollouts to prevent unsafe behaviors, outperforming existing Safe RL methods in success rate and constraint satisfaction across diverse tasks.

arxiv arXiv cs.AI · 22h ago

Governance Decay in Long-Horizon LLM Agents

Context compaction in long-horizon LLM agents silently removes in-context safety constraints, leading to prohibited tool actions. Across 1,323 episodes, compaction increases policy violations from 0% to 30% and up to 59% for some models, with violations reaching 38% when constraints are dropped. Constraint Pinning, a training-free method, restores zero violations by isolating governance constraints from compaction.

arxiv arXiv cs.AI · 22h ago

Generative Robust Optimisation Framework

Generative Robust Optimisation (GRO) introduces a deep generative model to define uncertainty sets, capturing nonlinear correlations, asymmetry, and multimodality. A five-point evaluation framework assesses neural network-based uncertainty sets across reconstruction fidelity, distribution matching, latent regularity, robust relevance, and computational tractability, with experiments validating GRO's effectiveness in production planning and facility location.

arxiv arXiv cs.AI · 22h ago

MacAgentBench Launches macOS AI Agent Benchmark

MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.

media r/LocalLLaMA · 23h ago

Dual GPU Sanity Check: Is This a Smart Buy?

A user asks whether adding a GTX 5060 Ti 16GB to their existing RTX 5090 setup is worth it for better VRAM to run larger LLMs and extend ComfyUI video generation. The upgrade would allow using Qwen 3.6 with 256K context and improve 1440p video generation, though performance gains in ComfyUI are limited due to current software constraints.

media r/LocalLLaMA · 23h ago

Qwen-AgentWorld-35B-A3B for Coding?

The Qwen-AgentWorld-35B-A3B model shows strong performance in coding tasks, with a 65.63% score on Software Writing Evaluation and 65.92% overall benchmark. It outperforms Qwen3.5-35B-A3B and rivals larger models in agent-based tasks, with a first impression noting superior accuracy in long-term agent workflows.

arxiv arXiv cs.AI · 23h ago

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

CCPL introduces a lightweight framework that anchors class prompts to frozen concept prototypes, improving few-shot CLIP adaptation. It achieves better base-to-new performance on DTD and EuroSAT compared to CoOp, with consistent gains from text-space concept regularization, while maintaining neutrality on OxfordPets. The method uses concept dropout and controllable ensemble fusion at inference, with results sensitive to dataset semantics and protocol.

arxiv arXiv cs.AI · 23h ago

SmartSDG Pipeline Enhances Syn-to-Real Object Detection

The paper introduces SmartSDG, an automated pipeline using NVIDIA Isaac Sim and Physically-Based Shading to optimize synthetic-to-real domain adaptation. It shows that indirect lighting and complex backgrounds improve object detection by preserving surface textures and reducing false positives, outperforming conventional direct-light synthetic data.

arxiv arXiv cs.AI · 23h ago

Context-Aware Distillation and Ablation for Text2DSL

A new Text2DSL system uses context-aware distillation with a structured context of BNF grammar, API specification, and closed identifier vocabulary. Ablation studies show that the vocabulary has the largest impact on semantic quality, while API and BNF significantly improve structural validity, confirming structured context as a critical, load-bearing component.

arxiv arXiv cs.AI · 1d ago

CWE-Level Generalisation in Syscall-Based HIDS

A one-class anomaly detector trained on normal behavior of CVEs sharing a CWE class can generalise to unseen CVEs within the same class, but effectiveness varies by CWE family. The CWE-307 detector achieves F1 = 0.6976 at 5% false positive rate, while CWE-89 and CWE-434 perform poorly, with F1 ≤ 0.21. Cross-CVE transfer is direction-dependent and driven more by the breadth of the source normal profile than the CWE category.