Evaluation & benchmarks
arxiv arXiv cs.LG · 23h ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining raw multimodal data to align with user and downstream intents. It uses a two-stage pipeline grounded in factual anchors to generate a large-scale dataset across five domains and combines supervised fine-tuning with GRPO to achieve strong alignment with complex refinement tasks. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density tailored data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.LG · 23h ago

VLA-FAIL: Lightweight Failure Detection for Vision-Language-Action Models

VLA-FAIL introduces a lightweight, failure detection framework for vision-language-action models that uses last-layer Mahalanobis distance and action chunk consistency without requiring failure data or expensive action sampling. The framework combines these detectors to achieve reliable, early failure detection across diverse tasks, outperforming baseline methods in both accuracy and efficiency.

arxiv arXiv cs.LG · 1d ago

Deep Learning Fuses Satellite Data with Meteorological Features for Soil Moisture Estimation

A study validates a Cross-Correlation Function method to identify optimal temporal and depth lags between meteorological variables and soil moisture. Using satellite and meteorological data across seven agricultural plots in southeastern Spain, deep learning models achieved significant improvements: a per-pixel CNN reached R² = 0.877, while a CNN-LSTM hybrid achieved the highest overall performance with R² = 0.930. Subsurface depth information and meteorological features substantially enhanced estimation accuracy.

arxiv arXiv cs.LG · 1d ago

Privacy-Preserving Federated Temporal Graph Learning for Cyber-Resilient IoMT

The paper introduces Federated TGCN-A2C, a privacy-preserving framework that achieves 99.48% and 99.61% test accuracy on CICDDoS 2019 and TON-IoT benchmarks, outperforming Fed-Inforce-Fusion by 0.21 percentage points. It includes anomaly detection, digital twin-based scoring, adaptive action selection, and an enhanced honeypot layer, with all major attack classes achieving F1 scores above 0.92 and 0.94, respectively, and provides post-hoc explainability via SHAP, LIME, Grad-CAM, and counterfactual analysis.