Reasoning models — korshunov.ai — ML news

Reasoning models Page 1 / 35

media Hugging Face Forums · 3d ago

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

The Buddy System uses a Rust entropy monitor to detect per-token uncertainty in local Gemma 3 4B inference, routing only uncertain tokens to Sonnet via NER-gated span extraction and semantic retrieval. Benchmarks show it achieves 71.4% accuracy at $0.21, outperforming the Anthropic Advisor pattern (62.9% at $0.44) across seven Hugging Face datasets, with a key improvement on SQuAD v2 by routing source passage chunks to the cloud model.

arxiv arXiv cs.CL · 3d ago

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering auditability, scalability, and verifiability in real-world financial scenarios.

arxiv arXiv cs.CL · 3d ago

Variance-Calibrated Modulation for LLM Decoding

VCM addresses the likelihood trap in large language model decoding by introducing dynamic mechanisms to reshape probability distributions. It improves diversity, coherence, and reasoning accuracy in open-ended generation, factual QA, and mathematical reasoning with minimal computational overhead.

arxiv arXiv cs.CL · 3d ago

Gazer: Training-Free Semantic Correction for Autoregressive Visual Models

Gazer introduces a training-free framework that uses multimodal large language model feedback to correct semantic errors in real time during autoregressive visual model generation. By integrating reflective diagnosis and semantic correction stages, Gazer improves compositional accuracy and semantic alignment across multiple models without additional training.

arxiv arXiv cs.CL · 3d ago

Multimodal Chain-of-Thought: Capabilities and Limitations

Multimodal Chain-of-Thought reasoning improves performance in mathematical and scientific reasoning but harms visual grounding and object counting in perception tasks. Models exhibit a 'Look Light, Think Heavy' pattern, where visual reflection diminishes while verbal reflection increases, indicating a persistent bottleneck in visual reasoning.

arxiv arXiv cs.CL · 3d ago

Key Factors in RL for LLM Reasoning Revealed

A theoretical analysis shows that off-policy degree, determined by gradient steps per rollout, significantly impacts importance sampling ratios and token update dominance. The study introduces Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries by token group variance, outperforming DAPO and CISPO on 3B and 7B models across mathematical, QA, and logic reasoning tasks.

arxiv arXiv cs.CL · 3d ago

Context-Aware Distillation and Ablation for Text2DSL

A new Text2DSL system uses context-aware distillation with a structured context of BNF grammar, API specification, and closed identifier vocabulary. Ablation studies show that the vocabulary has the largest impact on semantic quality, while API and BNF significantly improve structural validity, confirming structured context as a critical, not superficial, component.

arxiv arXiv cs.CL · 3d ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A 300M-parameter SLM fine-tuned on general-domain data achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, the SLM reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models on average. These results demonstrate that task-adapted small models can deliver accurate, private, and hardware-efficient performance without relying on large-scale generative models.

arxiv arXiv cs.CL · 3d ago

PeerCheck: Improving LLM-Generated Academic Reviews

PeerCheck analyzes differences between LLM and human academic reviews, finding LLMs focus on theory while humans prioritize methodology and experiments. The framework uses prompt engineering like Chain-of-Thought and retrieval-augmented generation, with CoT significantly improving review quality, though RAG introduces an unexpected 'paradox' that sometimes reduces quality.

arxiv arXiv cs.CL · 3d ago

Storyline Trees: Hierarchical Representations for Long-Form Narratives

Storyline trees provide hierarchical structures for long-form narratives by segmenting chapters into scenes and inferring narrative layers through top-down and bottom-up procedures. These trees enable adaptive retrieval, improving question-answering performance on three long-context narrative benchmarks compared to baseline methods, with gains confirmed through ablation studies.

arxiv arXiv cs.CL · 3d ago

Using LLM Internal Artifacts to Improve Legal Classification Reliability

This study explores leveraging internal artifacts of large language models to detect incorrect predictions in legal classification tasks. The approach uses features from these artifacts to build classifiers that identify erroneous outputs in bail decision and statute violation predictions. Results show internal artifacts reliably indicate incorrect responses, enhancing the overall reliability of LLM-based legal classification systems.

arxiv arXiv cs.CL · 3d ago

Token-Level Comparison of Transformers and Hybrid Models

A study using Olmo 3 and Olmo Hybrid open weights finds hybrid models outperform transformers on open-class content words and opening delimiters. The gains are less consistent for closed-class function words and closing delimiters, with hybrids excelling in semantic state tasks like pronoun memory and entity tracking, while transformers perform better on bracket-matching tasks. These results suggest recurrent layers enhance state-aware predictions, while attention supports n-gram and syntactic pattern recognition.

arxiv arXiv cs.CL · 3d ago

ViGiL3D++ Enables Diverse Language Generation for 3D Visual Grounding

ViGiL3D++ introduces a scalable, scene-agnostic method that generates diverse visual grounding queries by combining constraint sampling in scene graphs with large language model language generation. It outperforms existing models on multiple 3D visual grounding benchmarks and reveals key limitations of current vision-language models.

arxiv arXiv cs.CL · 3d ago

Test-Time Steering Resolves Temporal Fact Conflicts in LLMs

Researchers identify parametric temporal conflicts in language models where outdated facts persist in parameters. They introduce Temporal Attractor Steering (TAS), a test-time method that resolves 29-57% of such conflicts without retraining, maintaining 85-99% accuracy on non-conflict queries and outperforming a baseline on three of four models.

arxiv arXiv cs.CL · 3d ago

Metanym Game: Self-Contained LLM Benchmark for Structural Intelligence

The Metanym Game introduces a contamination-resistant benchmark for LLMs that measures structural intelligence through dynamic, on-the-fly analogy creation. A singular value decomposition of evaluator ratings reveals both generation and judging competence, with factual accuracy correlating strongly to GPQA Diamond at r = 0.92. Judging is a rarer skill: top generators are average judges, while top judges produce mid-tier outputs, and the strongest models earn seats in a council that self-rates and governs the benchmark.

arxiv arXiv cs.CL · 3d ago

LLMs Fall for Deception More Than Humans

A study finds that all 21 evaluated LLMs fall for deceptive traps at a significantly higher rate than human attackers. Despite recognizing traps in their reasoning, LLMs exploit deceptive elements 73.4% of the time, with no correlation between recognition and behavior (Spearman r = +0.-08, p = 0.73). These results show human-centered deception theories fail to apply to AI attackers, calling for AI-native defense research.

arxiv arXiv cs.CL · 3d ago

Demographic Metadata Harms DistilBERT Essay Scoring

A study finds that concatenating demographic metadata with text in DistilBERT-based essay scoring models degrades predictive accuracy and increases scoring bias. The experimental model achieved a lower Quadratic Weighted Kappa (0.656 vs. 0.727) and higher validation loss (1.29 vs. 1.25), with score parity dropping from 15 to 12 out of 19 tests.

arxiv arXiv cs.CL · 3d ago

FiLM-Coordinated Dual-Branch Transformer for Language Modeling

A new Transformer architecture introduces separate global and local branches for language modeling, using FiLM to dynamically coordinate them. Experiments show it outperforms single-branch and weakened dual-branch models on small datasets like TinyShakespeare and WikiText-2, with stable results across multiple seeds and channel-selective modulation patterns.

arxiv arXiv cs.CL · 3d ago

OTTER: Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization

OTTER is a black-box red-teaming framework that bypasses toxicity filters by modifying as few as five tokens. Evaluated on 457 AdvBench prompts across four GPT models, it increases jailbreak success rate from 7.0% to 84.0%, offering the first quantitative analysis of toxicity-bypass relationships and actionable recommendations for classifier hardening.

arxiv arXiv cs.CL · 3d ago

GRAG Framework Decouples Grounding and Personalization in Conversational AI

GRAG decouples content grounding and personalization in conversational models by using generic responses from large language models as a structural scaffold. This approach enables smaller, resource-limited models to achieve up to 47% improvement in ROUGE-2 and 36% in BLEU scores over state-of-the-art methods on diverse benchmarks.