Evaluation & benchmarks — korshunov.ai — ML news

Evaluation & benchmarks Page 1 / 43

media r/LocalLLaMA · 23h ago

Qwen3.6 27B more dumb in vLLM compared to llama.cpp

A user reports that Qwen3.6-27B runs significantly less intelligently in vLLM than in llama.cpp, exhibiting issues like ignoring messages, hallucinating tool calls, and failing to recognize prior conversation context. Despite proper configuration and prompt templates, the model appears to lose coherence and misinterprets its own tool usage, with errors occurring consistently rather than sporadically.

arxiv arXiv cs.LG · 23h ago

MedTS-TTT: Test-Time Training for Medical Time Series

MedTS-TTT introduces a test-time training framework for medical time series classification. Built on CLSA-TTT and a Gated Convolutional Backbone, it enables rapid, single-step adaptation without iterative optimization. On four public datasets, it achieves 11 top-1 rankings out of 12 evaluations across nine baselines and three metrics.

media r/LocalLLaMA · 23h ago

KaLM-Reranker-V1: Fast and Efficient Document Reranking

KaLM-Reranker-V1 is a fast but not late-interaction reranker that decouples query and passage computation while maintaining strong relevance modeling through cross-attention. It achieves state-of-the-art performance on BEIR, outperforms industrial models like Qwen3-Reranker, and shows excellent results on MIRACL and LMEB, with the 0.27B Nano model remaining competitive against 7-12B models.

arxiv arXiv cs.LG · 1d ago

Unsupervised anomaly detection with reservoir computers

A Kolmogorov--Smirnov test on reservoir computer output weights detects regime changes in nonlinear systems. The method distinguishes visually identical attractors, resolves parameter drifts seven times smaller than deep-learning baselines, and identifies ventricular flutter in ECG recordings.

arxiv arXiv cs.LG · 1d ago

Sea-Scan: ML-based Dark Vessel Detection with Weak Supervision

Sea-Scan uses machine learning to detect and localize dark vessels from unlabeled data. It achieves a 97.8% detection rate with only a 1.98% false-trigger rate, using weak supervision from imperfect AIS labels.

arxiv arXiv cs.LG · 1d ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining raw multimodal data to align with user and downstream intents. It uses a two-stage pipeline grounded in factual anchors to generate a large-scale dataset across five domains and combines supervised fine-tuning with GRPO to achieve strong alignment with complex refinement tasks. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density tailored data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.LG · 1d ago

Transformer Models Highly Sensitive to Noisy Data in Trajectory Prediction

A study finds that Transformer-based trajectory prediction models degrade significantly with noisy object state data. Accuracy drops by 1.3x under mild noise and up to 3.9x under realistic high noise conditions, highlighting the models' sensitivity and the need for noisier, real-world training data and mitigation strategies.

arxiv arXiv cs.LG · 1d ago

Open-Data Framework Identifies Urban Power Grid Topology

A new framework uses public infrastructure and OpenStreetMap data to reconstruct urban power grid topology from transmission to building-level connections. It successfully maps the grid for 7,330 buildings in Oslo's Alna district, enabling detailed power system analysis such as flow optimization and resilience studies.

arxiv arXiv cs.LG · 1d ago

SOHET: Transformer for Heterogeneous Event Streams

SOHET introduces a hierarchical transformer architecture with event-type-specific tabular encoders and self-supervised pre-training. It outperforms existing methods by 5.8% on Booking.com's fraud detection task and achieves state-of-the-art results on 6 out of 8 EBES benchmark tasks.

arxiv arXiv cs.LG · 1d ago

Graph-of-Differences for Anatomy-Structured MedReID

Graph-of-Differences (GoD) introduces anatomy-structured difference alignment for medical image re-identification. It represents images as anatomy graphs, computes differences over matched anatomical regions, and anchors retrieval signals to homologous structures. GoD improves Rank-1 accuracy by 7.1 pp on fundus and 3.1 pp on CXR, with better generalization in zero-shot settings.

arxiv arXiv cs.LG · 1d ago

VLA-FAIL: Lightweight Failure Detection for Vision-Language-Action Models

VLA-FAIL introduces a lightweight, failure detection framework for vision-language-action models that uses last-layer Mahalanobis distance and action chunk consistency without requiring failure data or expensive action sampling. The framework combines these detectors to achieve reliable, early failure detection across diverse tasks, outperforming baseline methods in both accuracy and efficiency.

arxiv arXiv cs.LG · 1d ago

CAT-Translate: Compact Japanese-English Translation Models

CAT-Translate introduces a family of small, open-source models (0.8B to 7B parameters) specialized for Japanese-English bidirectional translation. Using synthetic parallel corpora and a two-stage fine-tuning approach with Multi-Objective GRPO, the models outperform multilingual models on real-world benchmarks across business, legal, medical, financial, and patent domains.

arxiv arXiv cs.LG · 1d ago

ADualVUOT: Heterogeneous Latent Space Alignment for Unsupervised Domain Adaptation

ADualVUOT introduces a dual-encoder VAE with Continuous Normalizing Flows to improve latent representation flexibility in medical image segmentation. It uses Gaussian-Gromov-Wasserstein distance for domain alignment and adversarial augmentation to boost robustness, outperforming prior optimal transport-based methods on medical imaging benchmarks.

arxiv arXiv cs.LG · 1d ago

LDT-FRL Framework for Cyber-Resilient IoMT

The LDT-FRL framework introduces a privacy-preserving defense system for IoMT devices, combining temporal attention, lightweight digital twins, and federated reinforcement learning. It achieves 99.66% and 99.95% accuracy on CICDDoS 2019 and TON-IoT benchmarks, with perfect F1 on the MITM class, converging 81% faster than prior methods and offering interpretable defense decisions via SHAP and Grad-CAM.

arxiv arXiv cs.LG · 1d ago

Fast-TurboQuant: Multiplier-Free Vector Quantization

Fast-TurboQuant introduces a multiplier-free projection method using a structured fast Johnson-Lindenstrauss transform. It replaces dense random rotation matrices with Rademacher phase inversion and fast Walsh-Hadamard transform, reducing arithmetic to only additions and improving Recall@10 with lower mean squared error.

arxiv arXiv cs.LG · 1d ago

Post-Training Speech Enhancement with Perceptual Rewards

A new post-training method uses multi-metric perceptual rewards to optimize speech enhancement models. It directly applies non-differentiable metrics like DNSMOS, WER, and UTMOS as rewards via Group Sequence Policy Optimization, achieving state-of-the-art results on DNS2020. Human evaluation confirms that combining multiple metrics outperforms single-metric approaches, reducing reward hacking.

arxiv arXiv cs.LG · 1d ago

Native space pipelines outperform template space in subcortical segmentation

Native space-based UNet pipelines outperform template space ones in subcortical segmentation, showing higher Dice scores and lower HD95 values for the Subthalamic Nucleus, Red Nucleus, and Substantia Nigra. Performance drops significantly when applied to 3T images, with synthetic 3T training data providing only modest gains, highlighting a persistent domain gap between 7T and 3T MRI.

arxiv arXiv cs.LG · 1d ago

Deep Learning Fuses Satellite Data with Meteorological Features for Soil Moisture Estimation

A study validates a Cross-Correlation Function method to identify optimal temporal and depth lags between meteorological variables and soil moisture. Using satellite and meteorological data across seven agricultural plots in southeastern Spain, deep learning models achieved significant improvements: a per-pixel CNN reached R² = 0.877, while a CNN-LSTM hybrid achieved the highest overall performance with R² = 0.930. Subsurface depth information and meteorological features substantially enhanced estimation accuracy.

arxiv arXiv cs.LG · 1d ago

Adversarial Training Equivalence Fails for Nonlinear Models

A formal proof shows no equivalence exists between adversarial risk and regularized risk in two-layer networks. Empirical results on Wide-ResNets confirm this impossibility persists in deeper, more expressive architectures.

arxiv arXiv cs.LG · 1d ago

Machine Learning Model Predicts High-Risk Colorectal Polyps in African Americans

A machine learning model developed using pre-colonoscopy clinical features predicts high-risk colorectal polyps in African Americans. The model, validated in a diverse urban cohort, uses demographic, lifestyle, and comorbidity data to identify patients at higher risk, with external validation conducted in 2023-2024.