Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Clinician-Centered Pipeline for Ultrasound AI Annotation and Evaluation

A new pipeline enables clinicians to perform remote annotation and blinded evaluation of ultrasound AI models without local data downloads. It supports multi-rater participation, result aggregation, and automated statistical analysis, validated in a fetal ultrasound segmentation study with six raters of varying expertise. Results show moderate to strong agreement and a preference for later active learning models in blinded rankings.

arxiv arXiv cs.AI · 8d ago

LLM-as-Interface, ML-as-Predictor for Pediatric Appendicitis

ClaMPAPP, a hybrid system, uses an LLM to extract structured clinical features from free-text notes and passes them to an XGBoost classifier for diagnosis. It outperformed end-to-end LLMs in both internal and external validation, with better diagnostic performance and fewer missed cases, demonstrating superior stability and safety in pediatric appendicitis triage.

arxiv arXiv cs.AI · 8d ago

Decision-Focused RL for EV Charging with Unknown Departure Times

A decision-focused RL framework jointly trains a forecaster and charging controller to handle unknown EV departure times. The method improves charging decisions by up to 14% in total reward and reduces unsupplied energy by 55% compared to standard RL without forecasting.

arxiv arXiv cs.AI · 8d ago

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

MAST, a mechanism-guided unlearning method, achieves targeted forgetting of RLVR-induced reasoning with minimal collateral damage. On Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, it significantly reduces MATH performance (45/150 to 37/15-0) while preserving GSM8K accuracy by +0.8 points and maintaining MATH retention at -0.5 points. Results hold across seeds, objectives, and models, showing superior stability over full-parameter unlearning.

arxiv arXiv cs.AI · 8d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

arxiv arXiv cs.AI · 8d ago

TxBench-PP: AI Agent Benchmark in Preclinical Pharmacology

TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to derive accurate conclusions from real-world assay data. Across 16 model configurations, no system reliably passed all evaluations, with the best performing setup (Claude Opus 4.8 / Pi) achieving 59.3% success rate on 300 endpoint attempts.

arxiv arXiv cs.AI · 8d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a panoramic canvas using 3D world coordinates. It achieves state-of-the-art results on SQA3D and VSI-Bench, with strong generalization on SPBench, using significantly less training compute than prior methods.

arxiv arXiv cs.AI · 8d ago

X+Slides: Benchmark for Audience-Conditioned Slide Generation

X+Slides introduces a benchmark that evaluates slide generation based on target audience needs. It uses 8,133 source-grounded probes across 113 topics and seven scenes to measure Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness, revealing that current systems recover only partial audience-essential information, with DeepPresenter achieving 0.714 Audience Coverage, SlideTailor 0.594, and NotebookLM ablation 0.853, highlighting the need for source-grounded evaluation.

arxiv arXiv cs.AI · 8d ago

Trade-offs in Medical LLM Adaptation: French QA Study

A study compares continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for French medical QA. CPT+SFT performs best in multiple-choice QA, though gains over SFT are small and often insignificant, making SFT a cost-effective default. For open-ended QA, CPT improves metrics while SFT degrades quality, with instruction tuning and CPT+SFT favored by LLM-based evaluations. Cross-lingual results show effective transfer from French to English benchmarks.

arxiv arXiv cs.AI · 8d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extensible to continuous probability via monad instantiation.

arxiv arXiv cs.AI · 8d ago

Reverse-Engineering Transformer Attention with Executable Programs

A new method uses program synthesis to generate Python programs that reproduce attention patterns in transformer models. Fewer than 1,000 such programs achieve over 75% intersection-over-union similarity on TinyStories, and replacing 25% of attention heads with these programs increases perplexity by only 16% while preserving performance on question-answering tasks.

arxiv arXiv cs.AI · 8d ago

Rubric-Conditioned Self-Distillation Framework

Rubric-Conditioned Self-Distillation introduces a framework that uses structured rubrics to provide fine-grained, token-level feedback during self-distillation of reasoning language models. By conditioning teacher models on rubric-level criteria, it enables more precise credit assignment than scalar rewards, outperforming GRPO and OPSD by 1.0 and 0.9 points on average across science reasoning benchmarks.

arxiv arXiv cs.AI · 8d ago

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based RL

UBP2 introduces a model-based method that actively explores environments by jointly reasoning over uncertainties in reward, dynamics, and value functions. It achieves superior sample efficiency in preference-based reinforcement learning, outperforming both model-free and non-optimistic model-based baselines on the Meta-World benchmark.

media r/LocalLLaMA · 8d ago

Benchmarking small LLMs on hard HTML data extraction

A user tested models from 2B to 35B parameters on 29 difficult HTML data extraction pages, finding that smaller models like gemma4 e2b and e4b outperform larger ones. Qwen3.6 27B led in performance, while all MOE models scored poorly, highlighting the importance of task-specific benchmarking.

arxiv arXiv cs.CL · 8d ago

Dango: A Strictly L1-Only LLM for SLA Research

Dango is a 1.8B-parameter LLM designed to study Japanese-to-English second language acquisition. It uses a filtering method to minimize English contamination in monolingual pretraining, preserving realistic L1 exposure. Fine-tuned on LLM-generated lessons, Dango produces human-like L2 outputs, outperforming unfiltered and standard multilingual models.

arxiv arXiv cs.CL · 8d ago

LLM-as-Interface, ML-as-Predictor for Pediatric Appendicitis

ClaMPAPP, a hybrid system, uses an LLM to extract structured clinical features from free-text notes and passes them to an XGBoost classifier for diagnosis. It outperformed end-to-end LLMs in both internal and external validation, with better stability and fewer missed appendicitis cases, demonstrating superior diagnostic performance and safety in pediatric triage.

arxiv arXiv cs.CL · 8d ago

RECOM: Validity-Discrimination Tradeoff in Reddit QA Metrics

RECOM evaluates 15,000 r/AskReddit questions with authentic community replies posted after model training. It shows no automatic metric simultaneously achieves strong validity and discriminative power, with BERTScore ranking models weakly even when length is controlled. The tradeoff arises from representation design, not model differences, and requires reporting both validity and discrimination with random-baseline floors.

arxiv arXiv cs.CL · 8d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

arxiv arXiv cs.CL · 8d ago

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning

DreamReasoner-8B is an open-source block diffusion model that demonstrates strong long-chain-of-thought reasoning. A systematic study shows that small training block sizes preserve reasoning effectiveness, while large sizes degrade performance. Block-size curriculum learning gradually transitions training from fine to coarse blocks, enabling robust and generalizable reasoning across inference settings, with results competitive to Qwen3-8B on mathematical and code benchmarks.

arxiv arXiv cs.CL · 8d ago

Large Language Gibbs for Structured Probabilistic Inference

Large Language Gibbs uses LLM conditional distributions as transition operators for iterative variable resampling. This method enables probabilistically coherent structured inference by avoiding order-dependent biases and achieving a stationary distribution that balances local conditionals. It demonstrates practical efficacy in synthetic distributions, consistent reasoning, and Bayesian structure learning.