Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

ARIADNE: Agnostic Routing for Inference-time Adapter Selection

ARIADNE enables dynamic, training-free adapter selection at inference time by using centroids from adapter training data embeddings. It selects the most appropriate adapter based on proximity in latent space, without requiring access to adapter internals or additional training, and achieves 89.7% average selection accuracy across 44 NLP tasks.

arxiv arXiv cs.AI · 8d ago

ProductConsistency: Enhancing Product Identity in Image Editing

The ProductConsistency dataset introduces 87k SFT samples and 869 RL samples to improve product identity preservation in image editing. It includes a benchmark for standardized evaluation and uses a cyclic consistency reward to enforce semantic product identity through caption similarity. Fine-tuning Qwen-Image-Edit-2511 and Flux.1-Kontext-dev shows a 5x reduction in character error rate and improved text rendering and visual quality.

arxiv arXiv cs.AI · 8d ago

Leadership as Coordination Control in Multi-Agent LLM Teams

A study finds that leadership styles in multi-agent LLM teams only improve performance when the initial consensus is unreliable, recoverable, and not self-corrected by undirected interaction. Process-level coordination control adds value only under specific conditions predicted by team science, with no single leadership style outperforming others in accuracy across tasks and models.

arxiv arXiv cs.AI · 8d ago

Equivariant Graph Neural Networks Improve Optical Spectra Prediction

Equivariant graph neural networks outperform existing models in predicting optical spectra for materials screening. The adapted GotenNet achieves superior performance, especially in the 0-8 eV range and for static real permittivity prediction, critical for thin-film optics.

arxiv arXiv cs.AI · 8d ago

Human-AI Coevolution Framework Reveals Social Intelligence Emergence

The Human-AI Coevolution Dynamics Framework (HACD-H) introduces a unified model for long-term human-AI interaction, integrating emotional adaptation, memory, and personality into a self-organizing system. Results show social intelligence emerges through coevolution, with a significant negative correlation between social intelligence and social cognitive energy (r = -0.391, p < 0.001), and progressive energy reduction over time.

arxiv arXiv cs.AI · 8d ago

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

OrthoReg introduces orthogonal regularization to prevent neural components from relearning symbolic structures in hybrid dynamical systems. By directly penalizing overlap between symbolic and neural parts, it enables a complementary decomposition where symbolic models capture expressible physics and neural components handle remaining dynamics. On benchmarks with partial library mismatch, OrthoReg improves symbolic recovery and out-of-distribution performance.

arxiv arXiv cs.AI · 8d ago

AdsMind: Physics-Grounded Multi-Agent System for Adsorption Discovery

AdsMind is a closed-loop multi-agent system that uses machine learning force fields and feedback to correct errors in adsorption configuration searches on catalyst surfaces. It achieves 100% and 98.8% success rates on AA20 and OCD-GMAE62 benchmarks, reduces energy dispersion by 14-fold compared to baselines, and maintains correct adsorption-energy signs in DFT validation, outperforming open-loop LLM agents.

blog Simon Willison · 8d ago

GLM-5.2 is the leading open weights model on the Artificial Analysis Intelligence Index

GLM-5.2, a 753B-parameter text-only model from Z.ai, is now the top open weights model on the Artificial Analysis Intelligence Index, outperforming MiniMax-M3, DeepSeek V4 Pro, and Kimi K2.6. It features a 1 million token context window and ranks second on the Code Arena WebDev leaderboard, despite lacking image input capabilities.

media r/LocalLLaMA · 8d ago

Best models for a 12GB VRAM card

A user with a 12GB VRAM GPU asks for model recommendations for general chatting, roleplaying, and coding. They prioritize uncensored models for chat and roleplaying, and have a Ryzen 5600 CPU and 32GB DDR4 RAM.

media r/LocalLLaMA · 8d ago

I post-trained a model to reliably roll a die

A user trained a language model to roll a die, ensuring each number appears approximately once every six rolls. The post highlights how mainstream LLMs tend to default to saying '4' when asked to roll a die, illustrating a broader issue in reinforcement learning: models often fail to explore effectively and instead follow known patterns.

media Latent Space · 8d ago

Radical AI Achieves 10x Acceleration in Materials Discovery

Radical AI has accelerated materials discovery by producing and characterizing 1,200 alloys in six months—nearly 10x faster than DARPA/GE MACH's goal of 500 alloys in a year. Their self-driving labs use AI scientists to generate and test hypotheses in closed-loop systems, leading to 300 new materials with 10 exhibiting novel, state-of-the-art properties now being developed for commercial use.

media r/LocalLLaMA · 8d ago

LoopCoder-V2: Two-Loop PLT Model Achieves Best Gain-Cost Trade-Off

LoopCoder-V2 is a 7B instruction-tuned code model based on Parallel Loop Transformer (PLT), trained on 18T tokens of mixed text and code data. The two-loop variant achieves the best gain-cost balance, improving SWE-bench Verified from 43.0 to 64.4, while three or more loops result in regression due to increasing positional mismatch and unstable updates.

media r/LocalLLaMA · 8d ago

GLM-5.2 is a win for local AI

GLM-5.2, with 753B parameters and a 1M-token context window, is now accessible on local hardware through quantization. Its MIT license and extensive training data enable community fine-tuning of smaller models, promising significant improvements for local AI setups.

media r/LocalLLaMA · 8d ago

SIQ-1 Qwen3.6 Achieves Strong Performance in Autoresearch and Benchmarking

The SIQ-1 model, trained using PPO with verifiable reward, outperforms GLM-5.2 and Qwen-350B on parameter-golf tasks, with outputs resembling Opus4.8. It also beats NEX and GPT-5.5 on the bullshit-bench test. The model and GGUF version are available on Hugging Face, along with a ZeroGPU-compatible agent demo.

media r/LocalLLaMA · 8d ago

Is the needle in haystack problem solved?

A user asks whether the 'needle in haystack' benchmark—used to evaluate model performance—is still relevant or has been abandoned. The post reflects on its historical use in model releases and questions if it is now considered outdated or forgotten.

media r/LocalLLaMA · 8d ago

GLM-5.2: Built for Long-Horizon Tasks

GLM-5.2 is a language model designed specifically for long-horizon tasks. It aims to better handle complex, multi-step reasoning and long-term planning by improving its ability to maintain context over extended sequences.

arxiv arXiv cs.LG · 9d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.LG · 9d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 9d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.LG · 9d ago

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.