Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Zero-shot Procedural Mistake Detection with VLMs

A unified zero-shot framework, ZeProM, uses a pre-trained Video-Language Model to jointly perform procedural mistake detection and temporal action segmentation. It achieves up to 4.4 point improvement in EDA and 2.0 point in F1@.5 on EgoPER tasks, matching or exceeding supervised methods without task-specific training.

media r/LocalLLaMA · 1d ago

LLM Medical Scribing Benchmark: Omissions Outnumber Hallucinations

A benchmark of 8 LLMs on 300 synthetic doctor-patient dialogues found 12 high-impact hallucinations and 520 clinically relevant omissions. Omissions were far more common than hallucinations, with DeepSeek excelling in prose and cost but missing many safety facts, while Claude Opus had fewest omissions but poorer prose quality.

media r/LocalLLaMA · 1d ago

VibeThinker: 3B-parameter model beats Opus 4.5 in reasoning

VibeThinker, a 3-billion-parameter language model, outperforms Opus 4.5 in reasoning tasks using a novel SFT+GRPO training approach. The model was introduced in a paper available on arXiv, with details shared in a Reddit post.

media r/LocalLLaMA · 1d ago

Baidu Releases One-shot Long-horizon Parsing

Baidu has introduced a new parsing model called One-shot Long-horizon Parsing. The model enables efficient, long-range understanding of text with minimal training data, as demonstrated in a GitHub repository.

lab OpenAI News · 1d ago

GPT-5 Pro helps solve 3-year-old immunology mystery

GPT-5 Pro provided key insights into T cell behavior, resolving a 3-year-old immunology puzzle. The discovery may advance research in cancer and autoimmune diseases.

media r/LocalLLaMA · 2d ago

Best local models for reasoning in agentic AI

The creator of EverFern asks which local models work best for agentic workflows and browser/computer use. They note that model intelligence is rarely the bottleneck, with reliability and recovery systems being more critical than model choice.

media r/LocalLLaMA · 2d ago

Human Evaluation Shows GLM-5.2 Competes with Top Models

A human evaluation on Design Arena's leaderboard reveals GLM-5.2 performs nearly as well as Fable 5 in game development tasks, placing just one step below it. The model, based on open weights and MIT licensing, is assessed as equivalent in capability to the best available Claude models, suggesting that standardized benchmarks may no longer accurately reflect real-world performance.

media r/LocalLLaMA · 2d ago

SFT or RL-first for Qwen 3.5 Tool Agent Training?

A user asks whether supervised fine-tuning (SFT) followed by reinforcement learning (RL) is still recommended for training Qwen 3.5 4B or 9B agents for multi-tool use, or if RL-only approaches yield better results. The post also seeks guidance on reward design and handling parallel tool execution in agent workflows.

arxiv arXiv cs.CL · 2d ago

Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.

arxiv arXiv cs.CL · 2d ago

Unlimited OCR: Human-Like Parsing with Constant Memory

Unlimited OCR introduces Reference Sliding Window Attention (R-SWA) to emulate human working memory, enabling long-document transcription without growing memory usage. By replacing decoder attention layers in DeepSeek OCR, it maintains a constant KV cache and achieves full document processing in a single forward pass under 32K token limits. R-SWA is also applicable to ASR and translation tasks.

arxiv arXiv cs.CL · 2d ago

Dual-Track Framework for Template-Constrained LaTeX Conversion

A new Dual-Track Framework decouples template formatting from document processing by using an offline track to extract template constraints into a reusable manifest and an online track with a hybrid pipeline. It limits LLM use to reasoning tasks like metadata and bibliographic handling, while applying rule-based engines for deterministic operations, improving structural fidelity, layout compliance, and compilation success over baseline methods.

arxiv arXiv cs.CL · 2d ago

Self-Evolution of Tool-Calling Agents via Divergence-Point Preference Learning

ToolGraph enhances multi-turn tool-using agents by integrating schema topology, transition weights, and history-aware controls. Training with DPO on 161 divergence-point preference pairs improves performance: ToolGraph+DPO achieves a 16.8% relative reward gain over baseline, especially in airline and retail tasks, with reward positivity emerging as the key diagnostic signal.

arxiv arXiv cs.CL · 2d ago

PRIDE: Privileged Information-enhanced Distillation for Empathetic Dialogue Generation

PRIDE introduces a knowledge distillation method that transfers empathetic reasoning from large models to smaller ones using privileged information available only during training. It achieves competitive or superior performance on empathy-related tasks by leveraging structured prompts, multi-source attention, and dual-alignment loss.

media Hugging Face Forums · 2d ago

Coolest Theoretical AI Topics with Realistic AI System Basis

The discussion explores theoretical AI topics that have mathematical foundations and plausible implementation in current AI systems, such as large language models. Topics include reasoning chains, knowledge graphs, and probabilistic reasoning, all of which are grounded in formal math and show potential for real-world AI applications.

arxiv arXiv cs.CL · 2d ago

Language shapes historical credit in large language models

A study of 11 large language models across 21 disputed inventions shows that query language systematically influences which inventor is credited. Lower-status claimants appear more frequently when questions are phrased in their native language, while dominant Anglophone figures remain consistent. The findings suggest language acts as a switch that activates distinct national versions of history, indicating that LLMs function as systems of cultural memory.

arxiv arXiv cs.CL · 2d ago

DART: Training-Free Routing for Adaptive Thinking Budgets

DART enables hybrid reasoning models to route queries between direct answering and extended thinking without training data. It uses two no-think drafts to decide response mode and estimates thinking budget from draft disagreement. DART improves accuracy by up to 9.0 points in math and 22.-5 points in code reasoning while reducing thinking tokens by 15-69% and 51-63% respectively.

arxiv arXiv cs.CL · 2d ago

Memory Contagion: Bias Propagation in Agent Memory

Researchers identify Memory Contagion, a phenomenon where evaluator bias propagates across time in agent memory. Even with perfect memory consolidation, bias spreads to future agents retrieving from the same memory store, with contamination detected as low as p=0.2. The effect varies by bias type: length bias is attenuated, while authority bias is amplified, indicating a bias-dependent interaction.

arxiv arXiv cs.CL · 2d ago

Task-Sensitive Analysis of Intrinsic Self-Correction

A study examines when intrinsic self-correction works by analyzing its performance across different task structures. The research finds that self-correction yields consistent gains only when the task supports explicit constraint verification, complex reasoning revision, or strategy evaluation. The results show SC is effective only in specific task contexts, not universally.

arxiv arXiv cs.CL · 2d ago

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

CFPO introduces a cross-modal counterfactual enhancement mechanism to improve causal consistency between visual perception and textual reasoning in vision-language models. It achieves 3.17%-6.25% gains over standard RL baselines and 1.32%-2.13% over PAPO, without requiring external rewards or supervision.

arxiv arXiv cs.CL · 2d ago

Judgment-Grounded Expansion for Peer Review Generation

A new human-AI collaboration method called judgment-grounded expansion enables accountable peer review generation. The approach involves a reviewer providing an evaluative claim, which the system expands into review comment candidates through a structured generate-check-refine process. The study addresses scalable evaluation and candidate set curation, showing conformal prediction effectively balances candidate size and coverage.