Reasoning models
arxiv arXiv cs.AI · 8d ago

AdsMind: Physics-Grounded Multi-Agent System for Adsorption Discovery

AdsMind is a closed-loop multi-agent system that uses machine learning force fields and feedback to correct errors in adsorption configuration searches on catalyst surfaces. It achieves 100% and 98.8% success rates on AA20 and OCD-GMAE62 benchmarks, reduces energy dispersion by 14-fold compared to baselines, and maintains correct adsorption-energy signs in DFT validation, outperforming open-loop LLM agents.

arxiv arXiv cs.LG · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.LG · 8d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 8d ago

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.

arxiv arXiv cs.LG · 8d ago

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

NoiseTilt introduces NTRK, a reward-guided diffusion sampler that injects reward gradients via the noise term without altering the reverse kernel. By using a whitening operator, NTRK safely biases noise toward high reward, preserving sample quality while maintaining strong guidance. On aesthetic generation, NTRK achieves superior reward performance with 25 NFEs, reducing compute by 20× compared to state-of-the-art baselines.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.