Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

Sea-Scan: ML-based Dark Vessel Detection with Weak Supervision

Sea-Scan uses machine learning to detect and localize dark vessels from unlabeled data. It achieves a 97.8% detection rate with only a 1.98% false-trigger rate, using weak supervision from imperfect AIS labels.

arxiv arXiv cs.LG · 23h ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining raw multimodal data to align with user and downstream intents. It uses a two-stage pipeline grounded in factual anchors to generate a large-scale dataset across five domains and combines supervised fine-tuning with GRPO to achieve strong alignment with complex refinement tasks. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density tailored data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.LG · 23h ago

Transformer Models Highly Sensitive to Noisy Data in Trajectory Prediction

A study finds that Transformer-based trajectory prediction models degrade significantly with noisy object state data. Accuracy drops by 1.3x under mild noise and up to 3.9x under realistic high noise conditions, highlighting the models' sensitivity and the need for noisier, real-world training data and mitigation strategies.

arxiv arXiv cs.LG · 23h ago

Open-Data Framework Identifies Urban Power Grid Topology

A new framework uses public infrastructure and OpenStreetMap data to reconstruct urban power grid topology from transmission to building-level connections. It successfully maps the grid for 7,330 buildings in Oslo's Alna district, enabling detailed power system analysis such as flow optimization and resilience studies.

arxiv arXiv cs.LG · 23h ago

SOHET: Transformer for Heterogeneous Event Streams

SOHET introduces a hierarchical transformer architecture with event-type-specific tabular encoders and self-supervised pre-training. It outperforms existing methods by 5.8% on Booking.com's fraud detection task and achieves state-of-the-art results on 6 out of 8 EBES benchmark tasks.

arxiv arXiv cs.LG · 23h ago

Graph-of-Differences for Anatomy-Structured MedReID

Graph-of-Differences (GoD) introduces anatomy-structured difference alignment for medical image re-identification. It represents images as anatomy graphs, computes differences over matched anatomical regions, and anchors retrieval signals to homologous structures. GoD improves Rank-1 accuracy by 7.1 pp on fundus and 3.1 pp on CXR, with better generalization in zero-shot settings.

arxiv arXiv cs.LG · 23h ago

VLA-FAIL: Lightweight Failure Detection for Vision-Language-Action Models

VLA-FAIL introduces a lightweight, failure detection framework for vision-language-action models that uses last-layer Mahalanobis distance and action chunk consistency without requiring failure data or expensive action sampling. The framework combines these detectors to achieve reliable, early failure detection across diverse tasks, outperforming baseline methods in both accuracy and efficiency.

arxiv arXiv cs.LG · 23h ago

CAT-Translate: Compact Japanese-English Translation Models

CAT-Translate introduces a family of small, open-source models (0.8B to 7B parameters) specialized for Japanese-English bidirectional translation. Using synthetic parallel corpora and a two-stage fine-tuning approach with Multi-Objective GRPO, the models outperform multilingual models on real-world benchmarks across business, legal, medical, financial, and patent domains.

arxiv arXiv cs.LG · 1d ago

ADualVUOT: Heterogeneous Latent Space Alignment for Unsupervised Domain Adaptation

ADualVUOT introduces a dual-encoder VAE with Continuous Normalizing Flows to improve latent representation flexibility in medical image segmentation. It uses Gaussian-Gromov-Wasserstein distance for domain alignment and adversarial augmentation to boost robustness, outperforming prior optimal transport-based methods on medical imaging benchmarks.

arxiv arXiv cs.LG · 1d ago

LDT-FRL Framework for Cyber-Resilient IoMT

The LDT-FRL framework introduces a privacy-preserving defense system for IoMT devices, combining temporal attention, lightweight digital twins, and federated reinforcement learning. It achieves 99.66% and 99.95% accuracy on CICDDoS 2019 and TON-IoT benchmarks, with perfect F1 on the MITM class, converging 81% faster than prior methods and offering interpretable defense decisions via SHAP and Grad-CAM.

arxiv arXiv cs.LG · 1d ago

Fast-TurboQuant: Multiplier-Free Vector Quantization

Fast-TurboQuant introduces a multiplier-free projection method using a structured fast Johnson-Lindenstrauss transform. It replaces dense random rotation matrices with Rademacher phase inversion and fast Walsh-Hadamard transform, reducing arithmetic to only additions and improving Recall@10 with lower mean squared error.

arxiv arXiv cs.LG · 1d ago

Post-Training Speech Enhancement with Perceptual Rewards

A new post-training method uses multi-metric perceptual rewards to optimize speech enhancement models. It directly applies non-differentiable metrics like DNSMOS, WER, and UTMOS as rewards via Group Sequence Policy Optimization, achieving state-of-the-art results on DNS2020. Human evaluation confirms that combining multiple metrics outperforms single-metric approaches, reducing reward hacking.

arxiv arXiv cs.LG · 1d ago

Native space pipelines outperform template space in subcortical segmentation

Native space-based UNet pipelines outperform template space ones in subcortical segmentation, showing higher Dice scores and lower HD95 values for the Subthalamic Nucleus, Red Nucleus, and Substantia Nigra. Performance drops significantly when applied to 3T images, with synthetic 3T training data providing only modest gains, highlighting a persistent domain gap between 7T and 3T MRI.

arxiv arXiv cs.LG · 1d ago

Deep Learning Fuses Satellite Data with Meteorological Features for Soil Moisture Estimation

A study validates a Cross-Correlation Function method to identify optimal temporal and depth lags between meteorological variables and soil moisture. Using satellite and meteorological data across seven agricultural plots in southeastern Spain, deep learning models achieved significant improvements: a per-pixel CNN reached R² = 0.877, while a CNN-LSTM hybrid achieved the highest overall performance with R² = 0.930. Subsurface depth information and meteorological features substantially enhanced estimation accuracy.

arxiv arXiv cs.LG · 1d ago

Adversarial Training Equivalence Fails for Nonlinear Models

A formal proof shows no equivalence exists between adversarial risk and regularized risk in two-layer networks. Empirical results on Wide-ResNets confirm this impossibility persists in deeper, more expressive architectures.

arxiv arXiv cs.LG · 1d ago

Machine Learning Model Predicts High-Risk Colorectal Polyps in African Americans

A machine learning model developed using pre-colonoscopy clinical features predicts high-risk colorectal polyps in African Americans. The model, validated in a diverse urban cohort, uses demographic, lifestyle, and comorbidity data to identify patients at higher risk, with external validation conducted in 2023-2024.

arxiv arXiv cs.LG · 1d ago

JS Divergence Enhances GRPO Autoregressive Text-to-Image Alignment

A study introduces JS divergence in GRPO-style autoregressive text-to-image alignment, showing it effectively balances policy optimization and generation diversity. Experiments on LlamaGen and Janus-7B demonstrate JS divergence achieves top or competitive performance across metrics while preserving diverse outputs.

arxiv arXiv cs.LG · 1d ago

Privacy-Preserving Federated Temporal Graph Learning for Cyber-Resilient IoMT

The paper introduces Federated TGCN-A2C, a privacy-preserving framework that achieves 99.48% and 99.61% test accuracy on CICDDoS 2019 and TON-IoT benchmarks, outperforming Fed-Inforce-Fusion by 0.21 percentage points. It includes anomaly detection, digital twin-based scoring, adaptive action selection, and an enhanced honeypot layer, with all major attack classes achieving F1 scores above 0.92 and 0.94, respectively, and provides post-hoc explainability via SHAP, LIME, Grad-CAM, and counterfactual analysis.

arxiv arXiv cs.LG · 1d ago

Analytic Policy Gradients for Efficient Continuous Control

Analytic Policy Gradients (APG) enables exact gradient computation via backpropagation through simulation when environment dynamics are differentiable. APG outperforms Proximal Policy Optimization (PPO) on four continuous control tasks, showing superior sample and learning efficiency with a segmented backpropagation scheme that reduces gradient degradation on long-horizon tasks.

media Hugging Face Forums · 1d ago

Wav2vec2 and WavLM Audio Classifier Stuck at 33% Accuracy

A user reports that fine-tuning wav2vec2-base or wavlm-base-plus for 3-class audio classification achieves only 33% accuracy, matching chance levels. The model is trained with only the classification head updated, using padded clips of 1.0s duration without attention masks, and with a learning rate of 1e-3, leading to poor performance despite class imbalance and short input clips.