Reasoning models — korshunov.ai

Reasoning models Page 4 / 35

Social World Model for Lifelong Social Intelligence

The Social World Model decomposes social interaction into five dimensions to enable closed-loop learning. It allows open-source models to sustainably improve and retain social capabilities, outperforming baselines and matching closed-source Gemini 3 Flash in key metrics without forgetting across difficulty levels.

arxiv arXiv cs.AI · 2d ago

Ramanujan Graph Rewiring Alleviates GNN Over-Squashing

Ramanujan Propagation uses Ramanujan graphs to reduce over-squashing in Graph Neural Networks by ensuring non-negative resistance curvature. The method preserves local connectivity while enabling efficient long-range information flow, outperforming nine state-of-the-art rewiring techniques.

arxiv arXiv cs.AI · 2d ago

Transformer Models Highly Sensitive to Noisy Data in Trajectory Prediction

A study finds that Transformer-based trajectory prediction models degrade significantly with noisy object state data. Accuracy drops by 1.3x under mild noise and up to 3.9x under realistic high noise conditions, highlighting their sensitivity and the need for noisier, real-world training data and mitigation strategies.

arxiv arXiv cs.AI · 2d ago

LLMs Benchmarked for Web Vulnerability Detection

A study evaluates six LLMs on detecting real-world web vulnerabilities in WordPress plugins, finding detection rates vary by model and prompt design. Claude Opus 4.6 achieved the highest detection rate at 63%, while Qwen 3.5 only reached 35%, and no model consistently identified all baseline vulnerabilities across iterations.

arxiv arXiv cs.AI · 2d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

arxiv arXiv cs.AI · 2d ago

Explainable AI Model for Career-Related Depression in University Students

A new Explainable AI framework uses structured behavioral data and facial emotion features to detect early signs of career-related depression and anxiety in university students. The model, evaluated on Pakistani student data, achieves an F1-score of 89.12% and identifies key markers like avoidance of direct gaze and social withdrawal, aligning with psychological theory.

arxiv arXiv cs.AI · 2d ago

Decoupling Declarative and Procedural Knowledge in Vision-Language-Action Models

w$^{2}$VLA introduces a modular vision-language-action model that decouples declarative and procedural knowledge. By restructuring information flow, it enables robust behavior cloning and zero-shot skill transfer to novel, dissimilar objects.

arxiv arXiv cs.AI · 2d ago

Memory-Efficient Graph Filtering for Scalable Collaborative Filtering

Mem-GF introduces a memory-efficient graph filtering method that approximates polynomial graph filters using Krylov subspaces, avoiding storage of the full item similarity graph. It achieves up to 5.74× lower memory usage and 4.38× faster runtime while outperforming state-of-the-art methods in accuracy and scaling to datasets with tens of millions of interactions.

arxiv arXiv cs.AI · 2d ago

Zero-shot Procedural Mistake Detection with VLMs

A unified zero-shot framework, ZeProM, uses a pre-trained Video-Language Model to jointly perform procedural mistake detection and temporal action segmentation. It achieves up to 4.4 point improvement in EDA and 2.0 point in F1@.5 on EgoPER tasks, matching or exceeding supervised methods without task-specific training.

media r/LocalLLaMA · 2d ago

LLM Medical Scribing Benchmark: Omissions Outnumber Hallucinations

A benchmark of 8 LLMs on 300 synthetic doctor-patient dialogues found 12 high-impact hallucinations and 520 clinically relevant omissions. Omissions were far more common than hallucinations, with DeepSeek excelling in prose and cost but missing many safety facts, while Claude Opus had fewest omissions but poorer prose quality.

media r/LocalLLaMA · 2d ago

VibeThinker: 3B-parameter model beats Opus 4.5 in reasoning

VibeThinker, a 3-billion-parameter language model, outperforms Opus 4.5 in reasoning tasks using a novel SFT+GRPO training approach. The model was introduced in a paper available on arXiv, with details shared in a Reddit post.

media r/LocalLLaMA · 2d ago

Baidu Releases One-shot Long-horizon Parsing

Baidu has introduced a new parsing model called One-shot Long-horizon Parsing. The model enables efficient, long-range understanding of text with minimal training data, as demonstrated in a GitHub repository.

lab OpenAI News · 2d ago

GPT-5 Pro helps solve 3-year-old immunology mystery

GPT-5 Pro provided key insights into T cell behavior, resolving a 3-year-old immunology puzzle. The discovery may advance research in cancer and autoimmune diseases.

media r/LocalLLaMA · 2d ago

Best local models for reasoning in agentic AI

The creator of EverFern asks which local models work best for agentic workflows and browser/computer use. They note that model intelligence is rarely the bottleneck, with reliability and recovery systems being more critical than model choice.

media r/LocalLLaMA · 2d ago

Human Evaluation Shows GLM-5.2 Competes with Top Models

A human evaluation on Design Arena's leaderboard reveals GLM-5.2 performs nearly as well as Fable 5 in game development tasks, placing just one step below it. The model, based on open weights and MIT licensing, is assessed as equivalent in capability to the best available Claude models, suggesting that standardized benchmarks may no longer accurately reflect real-world performance.

media r/LocalLLaMA · 2d ago

SFT or RL-first for Qwen 3.5 Tool Agent Training?

A user asks whether supervised fine-tuning (SFT) followed by reinforcement learning (RL) is still recommended for training Qwen 3.5 4B or 9B agents for multi-tool use, or if RL-only approaches yield better results. The post also seeks guidance on reward design and handling parallel tool execution in agent workflows.

arxiv arXiv cs.CL · 2d ago

Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.

arxiv arXiv cs.CL · 2d ago

Unlimited OCR: Human-Like Parsing with Constant Memory

Unlimited OCR introduces Reference Sliding Window Attention (R-SWA) to emulate human working memory, enabling long-document transcription without growing memory usage. By replacing decoder attention layers in DeepSeek OCR, it maintains a constant KV cache and achieves full document processing in a single forward pass under 32K token limits. R-SWA is also applicable to ASR and translation tasks.

arxiv arXiv cs.CL · 2d ago

Dual-Track Framework for Template-Constrained LaTeX Conversion

A new Dual-Track Framework decouples template formatting from document processing by using an offline track to extract template constraints into a reusable manifest and an online track with a hybrid pipeline. It limits LLM use to reasoning tasks like metadata and bibliographic handling, while applying rule-based engines for deterministic operations, improving structural fidelity, layout compliance, and compilation success over baseline methods.

arxiv arXiv cs.CL · 2d ago

Self-Evolution of Tool-Calling Agents via Divergence-Point Preference Learning

ToolGraph enhances multi-turn tool-using agents by integrating schema topology, transition weights, and history-aware controls. Training with DPO on 161 divergence-point preference pairs improves performance: ToolGraph+DPO achieves a 16.8% relative reward gain over baseline, especially in airline and retail tasks, with reward positivity emerging as the key diagnostic signal.