Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Transformer Models: Architectures, Applications, and Critical Assessment

This review presents a taxonomy of transformer-based language models across domain verticals, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. It evaluates post-2023 advancements like instruction tuning and mixture-of-experts scaling, and assesses model deployments in healthcare, finance, legal, education, customer service, creative writing, and scientific work, linking each to specific capabilities. The paper critically analyzes model architectures on four key deployment axes, quantifies parameter count versus energy cost, and examines how alignment methods, data provenance, and benchmark saturation define 'state of the art'.

arxiv arXiv cs.CL · 2d ago

Age of LLM: Benchmark for LLM Reasoning and Diplomacy

Age of LLM introduces a turn-based 1v1 benchmark where two LLMs compete on a 13x7 grid under fog of war, full diplomacy, and strict JSON reliability rules. Findings show the nuclear rush dominates, diplomacy is prolific but rarely succeeds, and illegal actions reveal belief-tracking errors, with a weak link between reliability and victory. The corpus is small and unbalanced, and the results offer a preliminary view of LLM reasoning under adversarial uncertainty.

arxiv arXiv cs.CL · 2d ago

ExtractConf: Multi-Signal Confidence Engine for LLM Document Extraction

ExtractConf introduces a confidence engine that uses dual LLM readings—field-guided and document-guided—to detect unreliable extractions. It fuses disagreement between calls, LLM uncertainty, and document signals into a classifier, achieving 0.928 ROC AUC on invoices and reducing selective prediction risk by 70%.

arxiv arXiv cs.CL · 2d ago

Bayesian Control for Coding Agents

Bayesian control improves tool-use decisions in coding agents by modeling uncertainty and dynamically choosing actions. It outperforms fixed-rule orchestrators, especially when verification is costly and critics provide informative but imperfect feedback. The method also produces a more interpretable correctness score than token-probability or raw tool-success metrics.

arxiv arXiv cs.CL · 2d ago

RaDaR: AI Model Improves Rare Disease Diagnosis

RaDaR, a compact reasoning large language model, outperformed other open-source models in rare disease diagnosis. In a randomized trial, RaDaR improved physicians' diagnostic accuracy by 21.44 percentage points over internet search alone.

arxiv arXiv cs.CL · 2d ago

Cross-Lingual Exploration for Parametric Knowledge

Cross-lingual prompting strategies improve factual knowledge retrieval across 17 diverse languages. The approach outperforms native-language scaling in compute efficiency and enhances cross-lingual consistency beyond accuracy gains.

arxiv arXiv cs.CL · 2d ago

Qwen-AgentWorld: Language World Models for General Agents

Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B are the first language world models that simulate agentic environments across seven domains using long chain-of-thought reasoning. Trained via a three-stage pipeline—CPT, SFT, and RL—these models outperform existing frontier models on AgentWorldBench, a benchmark derived from real-world interactions of five models on nine established tasks.

arxiv arXiv cs.CL · 2d ago

Cross-Lingual Proverb Studies Reveal Cultural Meaning Preservation in LLMs

A study evaluates how large language models preserve cultural meaning when generating narratives from equivalent proverbs across 15 languages. Results show semantic consistency in moral lessons, with systematic shifts in narrative agency and structure, and strong convergence across model families. The research highlights that current evaluations may overestimate cultural preservation by focusing only on semantic similarity.

arxiv arXiv cs.CL · 2d ago

Privacy-Preserving RAG via Multi-Agent Semantic Rewriting

A multi-agent framework sanitizes retrieved content by removing sensitive identifiers through semantic rewriting, reducing privacy leakage in targeted attacks. It maintains strong contextual fidelity with a BLEU-1 score of 0.122, outperforming SAGE's 0.117, and operates as an asynchronous preprocessing step with no added latency to online inference.

arxiv arXiv cs.LG · 2d ago

Memory-Efficient Graph Filtering for Scalable Collaborative Filtering

Mem-GF introduces a memory-efficient graph filtering method that approximates polynomial graph filters using Krylov subspaces, eliminating the need to store the full item similarity graph. It achieves up to 5.74× lower memory usage and 4.38× faster runtime while maintaining superior recommendation accuracy compared to state-of-the-art methods, scaling effectively to datasets with tens of millions of interactions.

arxiv arXiv cs.LG · 2d ago

Distilling Transformers into Recurrent Transformers for Efficient Memory

A new distillation method transfers the observation compression strategy of full-history transformers to recurrent models. By training a teacher model to compress observation histories into fixed-size bottlenecks, the approach aligns the student's memory with the teacher's compression. This enables recurrent transformers to achieve near-full-history performance with linear-time complexity, making them viable for long-horizon robotics applications.

arxiv arXiv cs.LG · 2d ago

LIG: Layer-wise Integrated Gradients for Transformer Flow Analysis

LIG extends Integrated Gradients to set-to-set maps in Transformers, enabling token-level attribution within layers. It analyzes module-wise and layer-wide attribution consistency and tracks information flow via separate attention and MLP contributions, using target token embedding and zero or zero-attention outputs as baselines. LIG operates at module boundaries without retraining or custom interpreters, offering a diagnostic XAI tool for Transformer internals.

arxiv arXiv cs.LG · 2d ago

Cost Geometry of Belief in Noisy Inference

A finite-machine inference model uses cost geometry to quantify belief transitions, combining optimal transport with Fisher information. The framework reveals a wall, honesty, and rigidity in belief spaces, with the Gaussian belief achieving maximal hyperbolic curvature. Thermodynamics sets the cost unit, and the geometric floor of precision diverges at certainty, with the value -1/4 representing a key scale.

arxiv arXiv cs.AI · 2d ago

Profile-Based Reference in LLM Grounding

The paper argues that reference in large language models is not a fixed link but a profile-based, context-sensitive, and numerically structured phenomenon. It proposes that LLMs ground reference through linguistic traces parameterized via optimization, with referential profiles distributed and activated through context-sensitive computation, supported by mechanistic interpretability findings.

arxiv arXiv cs.AI · 2d ago

Linguistic Distance Affects Consensus in Neural Cellular Automata

A study on neural cellular automata shows that linguistic distance slows consensus and induces mild group divergence without full fragmentation. A collective trained under diverse communication protocols remains robust to mismatch, unlike one trained uniformly, and these results are consistent across ring and 2D grid structures, with parallels to human group dynamics.

arxiv arXiv cs.AI · 2d ago

Coherence Illusions in Dutch LLMs Revealed

Dutch language models exhibit coherence illusions similar to human readers. Surprisal and attention entropy metrics show that models are misled by context matches, with energy from associative memory identifying discourse coherence mechanisms.

arxiv arXiv cs.AI · 2d ago

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM Agents

ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.

arxiv arXiv cs.AI · 2d ago

FastGAN and Transformer Models Improve Aphid Detection in Faba Beans

A study uses FastGAN to generate 10,000 synthetic hyperspectral images of faba bean leaves, preserving real spectral and structural features. Transformer-based models, particularly Vision Transformer, achieve the highest accuracy and F1-scores in classifying healthy versus aphid-infested leaves, outperforming classical CNNs and demonstrating improved disease detection with reduced false negatives.

arxiv arXiv cs.AI · 2d ago

Topological Neural Dynamics: Neuron-wise Sequence Modeling

Topological Neural Dynamics (TND) introduces a neuron-wise framework for sequence modeling, where each neuron evolves independently through a directed graph structure. In a single-player Pong behavior cloning task, TND achieves a mean of 17.47 consecutive catches per round, surpassing all baseline models by more than three times.

arxiv arXiv cs.AI · 2d ago

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

NASDAQ addresses low-dimensional observation challenges in reinforcement learning by normalizing observation spaces to balance reconstruction losses across dimensions. The framework combines value learning with short-term value and next observation prediction, achieving competitive or superior performance with less training time compared to existing methods.