Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

VLA-FAIL: Lightweight Failure Detection for Vision-Language-Action Models

VLA-FAIL introduces a lightweight, failure detection framework for vision-language-action models that uses last-layer Mahalanobis distance and action chunk consistency without requiring failure data or expensive action sampling. The framework combines these detectors to achieve reliable, early failure detection across diverse tasks, outperforming baseline methods in both accuracy and efficiency.

arxiv arXiv cs.LG · 22h ago

CAT-Translate: Compact Japanese-English Translation Models

CAT-Translate introduces a family of small, open-source models (0.8B to 7B parameters) specialized for Japanese-English bidirectional translation. Using synthetic parallel corpora and a two-stage fine-tuning approach with Multi-Objective GRPO, the models outperform multilingual models on real-world benchmarks across business, legal, medical, financial, and patent domains.

arxiv arXiv cs.LG · 23h ago

ADualVUOT: Heterogeneous Latent Space Alignment for Unsupervised Domain Adaptation

ADualVUOT introduces a dual-encoder VAE with Continuous Normalizing Flows to improve latent representation flexibility in medical image segmentation. It uses Gaussian-Gromov-Wasserstein distance for domain alignment and adversarial augmentation to boost robustness, outperforming prior optimal transport-based methods on medical imaging benchmarks.

arxiv arXiv cs.LG · 23h ago

LDT-FRL Framework for Cyber-Resilient IoMT

The LDT-FRL framework introduces a privacy-preserving defense system for IoMT devices, combining temporal attention, lightweight digital twins, and federated reinforcement learning. It achieves 99.66% and 99.95% accuracy on CICDDoS 2019 and TON-IoT benchmarks, with perfect F1 on the MITM class, converging 81% faster than prior methods and offering interpretable defense decisions via SHAP and Grad-CAM.

arxiv arXiv cs.LG · 23h ago

Fast-TurboQuant: Multiplier-Free Vector Quantization

Fast-TurboQuant introduces a multiplier-free projection method using a structured fast Johnson-Lindenstrauss transform. It replaces dense random rotation matrices with Rademacher phase inversion and fast Walsh-Hadamard transform, reducing arithmetic to only additions and improving Recall@10 with lower mean squared error.

arxiv arXiv cs.LG · 23h ago

Post-Training Speech Enhancement with Perceptual Rewards

A new post-training method uses multi-metric perceptual rewards to optimize speech enhancement models. It directly applies non-differentiable metrics like DNSMOS, WER, and UTMOS as rewards via Group Sequence Policy Optimization, achieving state-of-the-art results on DNS2020. Human evaluation confirms that combining multiple metrics outperforms single-metric approaches, reducing reward hacking.

arxiv arXiv cs.LG · 23h ago

Native space pipelines outperform template space in subcortical segmentation

Native space-based UNet pipelines outperform template space ones in subcortical segmentation, showing higher Dice scores and lower HD95 values for the Subthalamic Nucleus, Red Nucleus, and Substantia Nigra. Performance drops significantly when applied to 3T images, with synthetic 3T training data providing only modest gains, highlighting a persistent domain gap between 7T and 3T MRI.

arxiv arXiv cs.LG · 23h ago

Deep Learning Fuses Satellite Data with Meteorological Features for Soil Moisture Estimation

A study validates a Cross-Correlation Function method to identify optimal temporal and depth lags between meteorological variables and soil moisture. Using satellite and meteorological data across seven agricultural plots in southeastern Spain, deep learning models achieved significant improvements: a per-pixel CNN reached R² = 0.877, while a CNN-LSTM hybrid achieved the highest overall performance with R² = 0.930. Subsurface depth information and meteorological features substantially enhanced estimation accuracy.

arxiv arXiv cs.LG · 23h ago

Adversarial Training Equivalence Fails for Nonlinear Models

A formal proof shows no equivalence exists between adversarial risk and regularized risk in two-layer networks. Empirical results on Wide-ResNets confirm this impossibility persists in deeper, more expressive architectures.

arxiv arXiv cs.LG · 1d ago

Machine Learning Model Predicts High-Risk Colorectal Polyps in African Americans

A machine learning model developed using pre-colonoscopy clinical features predicts high-risk colorectal polyps in African Americans. The model, validated in a diverse urban cohort, uses demographic, lifestyle, and comorbidity data to identify patients at higher risk, with external validation conducted in 2023-2024.

arxiv arXiv cs.LG · 1d ago

JS Divergence Enhances GRPO Autoregressive Text-to-Image Alignment

A study introduces JS divergence in GRPO-style autoregressive text-to-image alignment, showing it effectively balances policy optimization and generation diversity. Experiments on LlamaGen and Janus-7B demonstrate JS divergence achieves top or competitive performance across metrics while preserving diverse outputs.

arxiv arXiv cs.LG · 1d ago

Privacy-Preserving Federated Temporal Graph Learning for Cyber-Resilient IoMT

The paper introduces Federated TGCN-A2C, a privacy-preserving framework that achieves 99.48% and 99.61% test accuracy on CICDDoS 2019 and TON-IoT benchmarks, outperforming Fed-Inforce-Fusion by 0.21 percentage points. It includes anomaly detection, digital twin-based scoring, adaptive action selection, and an enhanced honeypot layer, with all major attack classes achieving F1 scores above 0.92 and 0.94, respectively, and provides post-hoc explainability via SHAP, LIME, Grad-CAM, and counterfactual analysis.

arxiv arXiv cs.LG · 1d ago

Analytic Policy Gradients for Efficient Continuous Control

Analytic Policy Gradients (APG) enables exact gradient computation via backpropagation through simulation when environment dynamics are differentiable. APG outperforms Proximal Policy Optimization (PPO) on four continuous control tasks, showing superior sample and learning efficiency with a segmented backpropagation scheme that reduces gradient degradation on long-horizon tasks.

media Hugging Face Forums · 1d ago

Wav2vec2 and WavLM Audio Classifier Stuck at 33% Accuracy

A user reports that fine-tuning wav2vec2-base or wavlm-base-plus for 3-class audio classification achieves only 33% accuracy, matching chance levels. The model is trained with only the classification head updated, using padded clips of 1.0s duration without attention masks, and with a learning rate of 1e-3, leading to poor performance despite class imbalance and short input clips.

arxiv arXiv cs.CL · 1d ago

ParaPairAudioBench: Benchmark for Paralinguistic Speech Evaluation

ParaPairAudioBench introduces a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions. It reveals that current LALM judges lag human judgments by 32% on average and fail to calibrate, especially in tie cases where abstention is correct.

arxiv arXiv cs.CL · 1d ago

AI-PAVE-Br: LLM-Based PAVE for Brazilian E-Commerce

AI-PAVE-Br uses large language models to enhance product attribute value extraction in Brazilian e-commerce. The system outperforms traditional NER methods, with a new Golden Set dataset providing a manually annotated benchmark for Portuguese product data.

arxiv arXiv cs.CL · 1d ago

DREAM: Autoregressive Training for Dense Retrieval Embeddings

DREAM uses autoregressive next-token prediction to supervise dense retrieval embedding training. It injects query-document similarity scores into a frozen LLM's attention heads, enabling gradient backpropagation for retriever optimization. DREAM outperforms baselines on BEIR and RTEB benchmarks across model scales.

arxiv arXiv cs.CL · 1d ago

CN-NewsTTS Bench v0.1 Released

CN-NewsTTS Bench v0.1 is an open benchmark for evaluating Chinese news TTS systems' ability to correctly pronounce raw text targets. It includes 200 development and 800 public test records, 992 auto-evaluable targets, and results for seven TTS systems, with the best achieving 0.879 strict accuracy and several below 0.60.

arxiv arXiv cs.CL · 1d ago

Task Decomposition for Efficient Annotation

We propose decomposing structured annotation tasks into sub-tasks to reduce overall inferential load. By identifying salient anchor entities—centers in the space of valid annotations—we constrain output complexity and improve cost-efficiency. We provide guidelines for decomposition and a procedure to allocate sub-tasks across human and model annotators for optimal quality under fixed budgets.

arxiv arXiv cs.CL · 1d ago

CANDLE: Lightweight Arabic Noise Deduplication via CTC

CANDLE is a lightweight system that uses Connectionist Temporal Classification to deduplicate repeated characters in Arabic text, without relying on handcrafted rules or morphological analyzers. It achieves a Sentence Error Rate of 5.37% and reduces tokenizer fertility by up to 12.8%, lowering inference costs and improving context window usage.