Reasoning models — korshunov.ai

Topic · Reasoning models

GPT-5 Pro helps solve 3-year-old immunology mystery

GPT-5 Pro provided key insights into T cell behavior, resolving a 3-year-old immunology puzzle. The discovery may advance research in cancer and autoimmune diseases.

lab OpenAI News · 3d ago

GPT-5.5 Instant Enhances ChatGPT's Health Responses

GPT-5.5 Instant improves ChatGPT's health and wellness responses through stronger reasoning, better context handling, clearer communication, and physician-informed evaluations.

media Hugging Face Forums · 2d ago

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

The Buddy System uses a Rust entropy monitor to detect per-token uncertainty in local Gemma 3 4B inference, routing only uncertain tokens to Sonnet via NER-gated span extraction and semantic retrieval. Benchmarks show it achieves 71.4% accuracy at $0.21, outperforming the Anthropic Advisor pattern (62.9% at $0.44) across seven Hugging Face datasets, with a key improvement on SQuAD v2 by routing source passage chunks to the cloud model.

arxiv arXiv cs.CL · 2d ago

Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection

A new hierarchical attention model detects multi-turn jailbreaks by encoding turns into compact representations and using a lightweight conversation module to capture dialogue dynamics. On 14,038 conversations, it achieves an F1 score of 0.9394, outperforming Claude Opus 4.7 by 0.07 and reducing false-positive rate by half. Ablation studies show that combining cross-attention and self-attention in the conversation module lowers false positives by 2.26 percentage points.

arxiv arXiv cs.CL · 2d ago

Study Finds AI Still Fails to Detect Legal Citation Hallucinations

A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.

media MarkTechPost · 2d ago

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration Model

Sakana AI has launched Sakana Fugu, an orchestration model that routes tasks across a swappable pool of frontier LLMs via a single OpenAI-compatible API. Fugu Ultra outperforms individual models on key benchmarks like SWE Bench Pro and GPQA-D, and the system demonstrates superior performance on complex, multi-step tasks such as auto-research, Rubik's Cube solving, and blindfold chess.

media Hugging Face Forums · 3d ago

LLMs as Epistemic Accelerators: The Risk Is Not Only Hallucination

LLMs do not merely hallucinate; they amplify human epistemic overconfidence by turning weak hypotheses into coherent, polished claims before evidence is verified. This creates a risk of premature certainty in research, policy, and other domains, not because models lie, but because they accelerate human tendencies to favor elegant explanations over uncertainty.

media Don't Worry About the Vase · 5d ago

Claude Fable 5 and Mythos 5: Capabilities

Anthropic launched Claude Fable 5, a Mythos-class model claiming state-of-the-art performance across software engineering, scientific research, and knowledge work. It was quickly taken down by the U.S. government after a jailbreak was reported, though Anthropic asserts it is now available again, with Fable 5 showing exceptional capabilities and a more nuanced, thoughtful reasoning style compared to prior models.

arxiv arXiv cs.AI · 6d ago

Calibration Without Comprehension in LLM Vulnerability Detection

CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.

arxiv arXiv cs.AI · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.LG · 6d ago

Probe-and-Refine Tuning Improves Coding Agent Performance

A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.

arxiv arXiv cs.CL · 6d ago

H-RePlan: Hierarchical Recovery for Cross-Device Agent Systems

H-RePlan introduces a hierarchical replanning framework that separates device-local strategy recovery from global orchestrator replanning. It outperforms existing baselines by achieving higher completion and instruction adherence, with reduced token cost, through scope-aware recovery in multi-device agent systems.

arxiv arXiv cs.CL · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders these states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.AI · 6d ago

Lean as Process-Verified Reward Oracle in RL for Theorem Proving

This work shows that Lean can serve as a symbolic process oracle, providing fine-grained, verified feedback during reinforcement learning. By parsing proof attempts into tactic sequences and using Lean's elaboration to mark sound steps and first failures, the system generates dense, type-theoretic reward signals. Experiments demonstrate tactic-level supervision outperforms outcome-only methods on benchmarks like MiniF2F and ProofNet, highlighting Lean's role as both evaluator and training reward source.

arxiv arXiv cs.AI · 6d ago

Dual-Agent Framework for Cross-Model Verified Translation

A dual-agent framework converts natural-language experiment protocols into executable commands for robotic lab platforms. It uses a Parser Agent and a rule-based mapping engine to translate protocols, with a heterogeneous LLM Validation Agent ensuring accuracy and triggering self-correction. The framework successfully enables end-to-end autonomous execution of microplate-based experiments like the Bradford assay.

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.AI · 6d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.AI · 6d ago

Finetuning VLA Models Requires Fewer Layers Than Thought

Vision-Language-Action models show severe layer-wise redundancy despite large parameter counts. A training-free compression method using Centered Kernel Alignment removes twin layers, reducing model depth by up to 50% and enabling 40-50% faster training and up to 30% faster inference without performance loss, validated across simulation and real-world robotic tasks.

arxiv arXiv cs.AI · 6d ago

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill proposes a method to compress natural-language skills into compact latent priors, improving task performance on SearchQA, LiveMath, and DocVQA. It outperforms SkillOpt by 5.2 to 12.5 points on key benchmarks while replacing hundreds to thousands of Markdown tokens with a few virtual tokens.

arxiv arXiv cs.AI · 6d ago

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass uses runtime and compiler evidence to guide LLM-generated optimization decisions, outperforming expert heuristics and classical autotuning methods. It achieves geometric-mean speedups of 1.043x on x86-64 and 1.117x on ARM64 systems without prior training or fine-tuning.