Reasoning models — korshunov.ai — ML news

Reasoning models Page 1 / 35

arxiv arXiv cs.CL · 7d ago

Bayesian Curriculum Learning on LLM Latent Manifolds

Manifold Bandits introduces Bayesian Manifold Curriculum (BMC), a framework that models problem sampling as a structured bandit problem in LLMs' latent space. BMC organizes tasks into a hierarchical tree and uses Bayesian learning to guide sampling, revealing tradeoffs between learning signal, task diversity, and evaluation relevance. Prioritizing difficulty alone fails to achieve strong downstream performance, underscoring the need for structure and type-aware sampling.

arxiv arXiv cs.CL · 7d ago

AgentFinVQA: Auditable, On-Premise Financial Chart QA

AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.

arxiv arXiv cs.CL · 7d ago

CombEval: Benchmark for Combinatorial Counting in LLMs

CombEval is a dynamic benchmark that generates natural-language counting problems with verified answers using typed Cofola specifications. It evaluates 11 large language models and reveals persistent failures in handling ordered objects, indistinguishable elements, positional constraints, and nested dependencies, with errors rooted in constraint interpretation and counting principles.

arxiv arXiv cs.CL · 7d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 7d ago

Semantic Clusters Pre-Train Tsetlin Machine for Interpretability

A new framework pre-trains the Tsetlin Machine using semantic clusters from language models, avoiding embeddings. The method groups text samples into coherent clusters via K-means or Top2Vec, then uses cluster-sample pairs to train a non-negated TM with Type I feedback. Results show superior performance across five datasets, matching BERT-level accuracy while maintaining full interpretability.

arxiv arXiv cs.CL · 7d ago

Credence: Semantic Metrics and Convergence Analysis for Claim Decomposition

Credence introduces Semantic-F1, a BGE-large cosine similarity metric that improves claim decomposition accuracy over Jaccard by 15-32 percentage points. It establishes convergence theorems for rule- and LLM-based repair, showing rule-based repair is finitely terminating and monotone, while LLM-based repair requires early-exit guards. Evaluations across social-media, encyclopaedic, and news domains show EPR from 0.94 to 1.00, with rule-repair reducing atomicity violations by 47-100% without fidelity loss.

arxiv arXiv cs.CL · 7d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 7d ago

AtomMem: Simple and Effective Memory System for LLM Agents

AtomMem introduces a memory system that stores high-value atomic facts from long-form interactions. It uses hierarchical event structures and temporal profiles to capture coherent episodic contexts and track evolving user attributes, enabling stable and efficient memory evolution. Experiments on the LoCoMo benchmark show AtomMem achieves state-of-the-art performance in reasoning tasks.

arxiv arXiv cs.CL · 7d ago

Zero-Shot Agentic LLMs Extract Lung Pathology from Narratives

A zero-shot agentic workflow using open-source LLMs extracts 13 College of American Pathologists synoptic fields from lung resection pathology reports. The best model (GPT-OSS-20B) achieved a Micro-F1 of 0.893, outperforming baseline recall and accurately capturing complex pathologic relations without task-specific training.

arxiv arXiv cs.CL · 7d ago

LLMs Can Process Non-Readable Text with High Semantic Fidelity

Large language models can maintain 99.5% semantic fidelity when processing compact, non-human-readable text forms called BabelTele, even when the text is reduced to 27.9% of its original length. These model-centric representations show strong performance in cross-model transfer, agent memory, and multi-agent communication, suggesting that human readability is not essential for semantic recovery in LLMs.

arxiv arXiv cs.CL · 7d ago

AI-Driven Deliberation: Scaling Inclusivity and Empowering Marginalised Groups

Large Language Models can scale democratic deliberation by scaffolding argumentation and reducing linguistic biases. The chapter uses Systemic-Functional Linguistics to analyze how socio-demographic and communicative variations affect participation, highlighting AI's potential to challenge exclusionary norms while cautioning against over- or under-claiming its capabilities. It calls for ethical safeguards and further research to ensure equitable AI-assisted engagement.

arxiv arXiv cs.CL · 7d ago

Lightweight Pronunciation Assessment via Discrete Speech Token Surprisal

A new framework assesses pronunciation using only native speech data, without labeled errors. It uses speech token surprisal and transcript-guided alignment to detect phonotactic deviations, achieving performance close to supervised methods on multiple datasets.

arxiv arXiv cs.CL · 7d ago

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS enables training-free superposition of multiple semantic directions in LLMs by addressing distributional deviation and directional interference through geometric constraints. On GSM8K, it maintains 98% accuracy with three non-mathematical directions, while unconstrained addition drops to 4%; on Wikitext-2, it increases PPL by only 2.2%.

arxiv arXiv cs.CL · 7d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to learn 'Connect the Dots' by using reinforcement learning with long rollout sequences. The method includes tailored tasks and environments to foster meta-capability development, showing strong cross-domain generalization and performance in out-of-distribution settings. Implementations are available at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.

arxiv arXiv cs.CL · 7d ago

Segment-Level Mandarin Speech Detection for Cognitive Impairment

A new framework uses autoencoder with contrastive learning to analyze segment-level Mandarin speech for cognitive impairment detection. It achieves stable, competitive performance across four datasets, with significant improvements in three-class classification, especially under limited labeled data conditions.

arxiv arXiv cs.CL · 7d ago

Information-Theoretic Analysis of Effective Supervision in Latent Chain-of-Thought

This work identifies a dual collapse in latent reasoning: gradient attenuation and representational drift. It proposes Trajectory and Space Supervision, showing that generative reconstruction preserves information capacity better than geometric compression. The Unified Latent Probe measures mutual information between latent trajectories and reasoning steps, revealing an information-performance binding in reasoning accuracy.

arxiv arXiv cs.CL · 7d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.CL · 7d ago

No Self-Preference in Model Revision Under Genuine Authorship

A four-model test on IFEval shows no detectable self-preference in large language models when revising their own text. Authors reject verified-good edits at rates comparable to fresh models, with a gap of -5.1 percentage points (95% CI [-12.9, +2.7]). When authors do reject fixes, 97% of reasons are about detecting flaws, not preference.

arxiv arXiv cs.CL · 7d ago

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.

arxiv arXiv cs.CL · 7d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and real-world A/B testing, reducing interactions by 3 turns and increasing exercise conversion rates, especially with a stochastic router achieving 28.1%.