Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

GLM-5.2 is the leading open weights model on the Artificial Analysis Intelligence Index

GLM-5.2, a 753B-parameter text-only model from Z.ai, is now the top open weights model on the Artificial Analysis Intelligence Index, outperforming MiniMax-M3, DeepSeek V4 Pro, and Kimi K2.6. It features a 1 million token context window and ranks second on the Code Arena WebDev leaderboard, despite lacking image input capabilities.

media r/LocalLLaMA · 8d ago

Best models for a 12GB VRAM card

A user with a 12GB VRAM GPU asks for model recommendations for general chatting, roleplaying, and coding. They prioritize uncensored models for chat and roleplaying, and have a Ryzen 5600 CPU and 32GB DDR4 RAM.

media r/LocalLLaMA · 8d ago

I post-trained a model to reliably roll a die

A user trained a language model to roll a die, ensuring each number appears approximately once every six rolls. The post highlights how mainstream LLMs tend to default to saying '4' when asked to roll a die, illustrating a broader issue in reinforcement learning: models often fail to explore effectively and instead follow known patterns.

media Latent Space · 8d ago

Radical AI Achieves 10x Acceleration in Materials Discovery

Radical AI has accelerated materials discovery by producing and characterizing 1,200 alloys in six months—nearly 10x faster than DARPA/GE MACH's goal of 500 alloys in a year. Their self-driving labs use AI scientists to generate and test hypotheses in closed-loop systems, leading to 300 new materials with 10 exhibiting novel, state-of-the-art properties now being developed for commercial use.

media r/LocalLLaMA · 8d ago

LoopCoder-V2: Two-Loop PLT Model Achieves Best Gain-Cost Trade-Off

LoopCoder-V2 is a 7B instruction-tuned code model based on Parallel Loop Transformer (PLT), trained on 18T tokens of mixed text and code data. The two-loop variant achieves the best gain-cost balance, improving SWE-bench Verified from 43.0 to 64.4, while three or more loops result in regression due to increasing positional mismatch and unstable updates.

media r/LocalLLaMA · 8d ago

GLM-5.2 is a win for local AI

GLM-5.2, with 753B parameters and a 1M-token context window, is now accessible on local hardware through quantization. Its MIT license and extensive training data enable community fine-tuning of smaller models, promising significant improvements for local AI setups.

media r/LocalLLaMA · 8d ago

SIQ-1 Qwen3.6 Achieves Strong Performance in Autoresearch and Benchmarking

The SIQ-1 model, trained using PPO with verifiable reward, outperforms GLM-5.2 and Qwen-350B on parameter-golf tasks, with outputs resembling Opus4.8. It also beats NEX and GPT-5.5 on the bullshit-bench test. The model and GGUF version are available on Hugging Face, along with a ZeroGPU-compatible agent demo.

media r/LocalLLaMA · 8d ago

Is the needle in haystack problem solved?

A user asks whether the 'needle in haystack' benchmark—used to evaluate model performance—is still relevant or has been abandoned. The post reflects on its historical use in model releases and questions if it is now considered outdated or forgotten.

media r/LocalLLaMA · 8d ago

GLM-5.2: Built for Long-Horizon Tasks

GLM-5.2 is a language model designed specifically for long-horizon tasks. It aims to better handle complex, multi-step reasoning and long-term planning by improving its ability to maintain context over extended sequences.

arxiv arXiv cs.LG · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.LG · 8d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 8d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.LG · 8d ago

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.

arxiv arXiv cs.LG · 8d ago

Uncertainty Quantification for Flow-Based Vision-Language-Action Models

We propose a method using velocity-field disagreement to quantify epistemic uncertainty in flow-matching vision-language-action models. This uncertainty estimate enables failure detection during deployment and active fine-tuning via the SAVE framework, which reduces expert demonstrations by at least 22% compared to baselines, with better-calibrated predictions on the LIBERO benchmark.

arxiv arXiv cs.LG · 8d ago

ConTex: Global Counterfactual Generation for Time Series Forecasting

ConTex reformulates counterfactual generation for time series forecasting as a globally consistent intervention problem. It achieves state-of-the-art validity with sparse, interpretable interventions, reduces computational cost by 12-36x, and enables real-time inference in approximately 0.007 seconds.

arxiv arXiv cs.LG · 8d ago

ScaFE: Using LLMs to Extract Clinically Meaningful Scar Features

ScaFE repositions large language models as feature engineers for scar classification, generating executable Python code from clinical criteria to extract interpretable features. The framework achieves superior performance with limited data, preserves privacy by processing images locally, and produces clinically grounded features aligned with established scoring systems like the Vancouver Scar Scale.

arxiv arXiv cs.LG · 8d ago

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

NoiseTilt introduces NTRK, a reward-guided diffusion sampler that injects reward gradients via the noise term without altering the reverse kernel. By using a whitening operator, NTRK safely biases noise toward high reward, preserving sample quality while maintaining strong guidance. On aesthetic generation, NTRK achieves superior reward performance with 25 NFEs, reducing compute by 20× compared to state-of-the-art baselines.

arxiv arXiv cs.LG · 8d ago

Tensor-based Second-order Causal Discovery Algorithm

TSCD uses covariance matrices from observational and interventional data to identify causal structures in linear structural equation models on DAGs. It requires only uncorrelated noise and achieves identifiable causal orders and parameters with logarithmic intervention counts, scaling to hundreds of variables while remaining robust to noise and competitive with existing methods.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.

arxiv arXiv cs.LG · 8d ago

OmniPlan: Adaptive Framework for Timely and Near-Optimal Network Planning

OmniPlan introduces an adaptive framework that converts natural-language user intents into quantifiable preferences using a large language model. It dynamically selects among mixed integer programming, heuristics, and deep reinforcement learning experts to achieve both timeliness and near-optimality in network planning. Evaluations on distributed machine learning workloads show up to 97.8% latency reduction and 11.5% lower resource consumption.