Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

MacAgentBench Launches macOS AI Agent Benchmark

MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.

media r/LocalLLaMA · 19h ago

Dual GPU Sanity Check: Is This a Smart Buy?

A user asks whether adding a GTX 5060 Ti 16GB to their existing RTX 5090 setup is worth it for better VRAM to run larger LLMs and extend ComfyUI video generation. The upgrade would allow using Qwen 3.6 with 256K context and improve 1440p video generation, though performance gains in ComfyUI are limited due to current software constraints.

media r/LocalLLaMA · 19h ago

Qwen-AgentWorld-35B-A3B for Coding?

The Qwen-AgentWorld-35B-A3B model shows strong performance in coding tasks, with a 65.63% score on Software Writing Evaluation and 65.92% overall benchmark. It outperforms Qwen3.5-35B-A3B and rivals larger models in agent-based tasks, with a first impression noting superior accuracy in long-term agent workflows.

arxiv arXiv cs.AI · 20h ago

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

CCPL introduces a lightweight framework that anchors class prompts to frozen concept prototypes, improving few-shot CLIP adaptation. It achieves better base-to-new performance on DTD and EuroSAT compared to CoOp, with consistent gains from text-space concept regularization, while maintaining neutrality on OxfordPets. The method uses concept dropout and controllable ensemble fusion at inference, with results sensitive to dataset semantics and protocol.

arxiv arXiv cs.AI · 20h ago

SmartSDG Pipeline Enhances Syn-to-Real Object Detection

The paper introduces SmartSDG, an automated pipeline using NVIDIA Isaac Sim and Physically-Based Shading to optimize synthetic-to-real domain adaptation. It shows that indirect lighting and complex backgrounds improve object detection by preserving surface textures and reducing false positives, outperforming conventional direct-light synthetic data.

arxiv arXiv cs.AI · 20h ago

Context-Aware Distillation and Ablation for Text2DSL

A new Text2DSL system uses context-aware distillation with a structured context of BNF grammar, API specification, and closed identifier vocabulary. Ablation studies show that the vocabulary has the largest impact on semantic quality, while API and BNF significantly improve structural validity, confirming structured context as a critical, load-bearing component.

arxiv arXiv cs.AI · 20h ago

CWE-Level Generalisation in Syscall-Based HIDS

A one-class anomaly detector trained on normal behavior of CVEs sharing a CWE class can generalise to unseen CVEs within the same class, but effectiveness varies by CWE family. The CWE-307 detector achieves F1 = 0.6976 at 5% false positive rate, while CWE-89 and CWE-434 perform poorly, with F1 ≤ 0.21. Cross-CVE transfer is direction-dependent and driven more by the breadth of the source normal profile than the CWE category.

arxiv arXiv cs.AI · 20h ago

Importance-Weighted On-Policy Distillation Addresses Position Bias

On-Policy Distillation (OPD) suffers from position bias where later tokens provide poor supervision. Importance-Weighted OPD (IW-OPD) assigns dynamic weights based on distribution discrepancy, prioritizing early tokens and suppressing late ones. IW-OPD converges faster and achieves up to 6.9 point performance gains on AIME-2025 compared to standard OPD.

arxiv arXiv cs.LG · 20h ago

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization

ROVER enables reward-free pretraining by maximizing occupancy coverage in state space, using a learned world model to estimate occupancy without density or entropy estimation. It introduces a virtual sink state to balance exploration of known and unknown regions, achieving more uniform coverage and better downstream performance in tabular and pixel-based navigation tasks.

arxiv arXiv cs.LG · 21h ago

TeaNet Improves Few-Shot Learning in Vibrational Spectroscopy

TeaNet, a task-enhanced augmentation network, reconstructs randomly masked spectra to generate augmented samples that preserve original spectral features while introducing domain-specific variations. This approach enables deep neural networks to identify discriminant wavenumbers more effectively, outperforming CNNs by 17% in challenging synthetic scenarios and offering improved interpretability in few-shot learning tasks.

arxiv arXiv cs.LG · 21h ago

TASER: Task-Differentiated Skill Expansion for Heterogeneous Continual Learning

TASER introduces a framework that dynamically expands and routes atomic skills for continual learning across highly heterogeneous tasks. It reduces catastrophic forgetting and improves plasticity by ensuring semantic distinctness and efficient capacity allocation through skill detection and routing mechanisms. Evaluated on HeteroCLBench, a benchmark with 19 diverse tasks across 9 cognitive dimensions, TASER outperforms existing baselines.

media r/LocalLLaMA · 21h ago

Qwen3.6 27B more dumb in vLLM compared to llama.cpp

A user reports that Qwen3.6-27B runs significantly less intelligently in vLLM than in llama.cpp, exhibiting issues like ignoring messages, hallucinating tool calls, and failing to recognize prior conversation context. Despite proper configuration and prompt templates, the model appears to lose coherence and misinterprets its own tool usage, with errors occurring consistently rather than sporadically.

arxiv arXiv cs.LG · 21h ago

MedTS-TTT: Test-Time Training for Medical Time Series

MedTS-TTT introduces a test-time training framework for medical time series classification. Built on CLSA-TTT and a Gated Convolutional Backbone, it enables rapid, single-step adaptation without iterative optimization. On four public datasets, it achieves 11 top-1 rankings out of 12 evaluations across nine baselines and three metrics.

media r/LocalLLaMA · 21h ago

KaLM-Reranker-V1: Fast and Efficient Document Reranking

KaLM-Reranker-V1 is a fast but not late-interaction reranker that decouples query and passage computation while maintaining strong relevance modeling through cross-attention. It achieves state-of-the-art performance on BEIR, outperforms industrial models like Qwen3-Reranker, and shows excellent results on MIRACL and LMEB, with the 0.27B Nano model remaining competitive against 7-12B models.

arxiv arXiv cs.LG · 21h ago

Unsupervised anomaly detection with reservoir computers

A Kolmogorov--Smirnov test on reservoir computer output weights detects regime changes in nonlinear systems. The method distinguishes visually identical attractors, resolves parameter drifts seven times smaller than deep-learning baselines, and identifies ventricular flutter in ECG recordings.

arxiv arXiv cs.LG · 21h ago

Sea-Scan: ML-based Dark Vessel Detection with Weak Supervision

Sea-Scan uses machine learning to detect and localize dark vessels from unlabeled data. It achieves a 97.8% detection rate with only a 1.98% false-trigger rate, using weak supervision from imperfect AIS labels.

arxiv arXiv cs.LG · 22h ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining raw multimodal data to align with user and downstream intents. It uses a two-stage pipeline grounded in factual anchors to generate a large-scale dataset across five domains and combines supervised fine-tuning with GRPO to achieve strong alignment with complex refinement tasks. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density tailored data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.LG · 22h ago

Transformer Models Highly Sensitive to Noisy Data in Trajectory Prediction

A study finds that Transformer-based trajectory prediction models degrade significantly with noisy object state data. Accuracy drops by 1.3x under mild noise and up to 3.9x under realistic high noise conditions, highlighting the models' sensitivity and the need for noisier, real-world training data and mitigation strategies.

arxiv arXiv cs.LG · 22h ago

Open-Data Framework Identifies Urban Power Grid Topology

A new framework uses public infrastructure and OpenStreetMap data to reconstruct urban power grid topology from transmission to building-level connections. It successfully maps the grid for 7,330 buildings in Oslo's Alna district, enabling detailed power system analysis such as flow optimization and resilience studies.

arxiv arXiv cs.LG · 22h ago

SOHET: Transformer for Heterogeneous Event Streams

SOHET introduces a hierarchical transformer architecture with event-type-specific tabular encoders and self-supervised pre-training. It outperforms existing methods by 5.8% on Booking.com's fraud detection task and achieves state-of-the-art results on 6 out of 8 EBES benchmark tasks.