Reasoning models
arxiv arXiv cs.AI · 8d ago

Clinician-Centered Pipeline for Ultrasound AI Annotation and Evaluation

A new pipeline enables clinicians to perform remote annotation and blinded evaluation of ultrasound AI models without local data downloads. It supports multi-rater participation, result aggregation, and automated statistical analysis, validated in a fetal ultrasound segmentation study with six raters of varying expertise. Results show moderate to strong agreement and a preference for later active learning models in blinded rankings.

arxiv arXiv cs.AI · 8d ago

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

MAST, a mechanism-guided unlearning method, achieves targeted forgetting of RLVR-induced reasoning with minimal collateral damage. On Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, it significantly reduces MATH performance (45/150 to 37/15-0) while preserving GSM8K accuracy by +0.8 points and maintaining MATH retention at -0.5 points. Results hold across seeds, objectives, and models, showing superior stability over full-parameter unlearning.

arxiv arXiv cs.AI · 8d ago

X+Slides: Benchmark for Audience-Conditioned Slide Generation

X+Slides introduces a benchmark that evaluates slide generation based on target audience needs. It uses 8,133 source-grounded probes across 113 topics and seven scenes to measure Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness, revealing that current systems recover only partial audience-essential information, with DeepPresenter achieving 0.714 Audience Coverage, SlideTailor 0.594, and NotebookLM ablation 0.853, highlighting the need for source-grounded evaluation.

arxiv arXiv cs.AI · 8d ago

Trade-offs in Medical LLM Adaptation: French QA Study

A study compares continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for French medical QA. CPT+SFT performs best in multiple-choice QA, though gains over SFT are small and often insignificant, making SFT a cost-effective default. For open-ended QA, CPT improves metrics while SFT degrades quality, with instruction tuning and CPT+SFT favored by LLM-based evaluations. Cross-lingual results show effective transfer from French to English benchmarks.

arxiv arXiv cs.AI · 8d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extensible to continuous probability via monad instantiation.

arxiv arXiv cs.CL · 8d ago

RECOM: Validity-Discrimination Tradeoff in Reddit QA Metrics

RECOM evaluates 15,000 r/AskReddit questions with authentic community replies posted after model training. It shows no automatic metric simultaneously achieves strong validity and discriminative power, with BERTScore ranking models weakly even when length is controlled. The tradeoff arises from representation design, not model differences, and requires reporting both validity and discrimination with random-baseline floors.

arxiv arXiv cs.CL · 8d ago

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning

DreamReasoner-8B is an open-source block diffusion model that demonstrates strong long-chain-of-thought reasoning. A systematic study shows that small training block sizes preserve reasoning effectiveness, while large sizes degrade performance. Block-size curriculum learning gradually transitions training from fine to coarse blocks, enabling robust and generalizable reasoning across inference settings, with results competitive to Qwen3-8B on mathematical and code benchmarks.