Reasoning models — korshunov.ai — ML news

Reasoning models Page 7 / 35

arxiv arXiv cs.CL · 2d ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A 300M-parameter SLM fine-tuned on general-domain data achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, the SLM reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models on average. These results demonstrate that task-adapted small models can deliver accurate, private, and hardware-efficient performance without relying on large-scale generative models.

arxiv arXiv cs.CL · 2d ago

PeerCheck: Improving LLM-Generated Academic Reviews

PeerCheck analyzes differences between LLM and human academic reviews, finding LLMs focus on theory while humans prioritize methodology and experiments. The framework uses prompt engineering like Chain-of-Thought and retrieval-augmented generation, with CoT significantly improving review quality, though RAG introduces an unexpected 'paradox' that sometimes reduces quality.

arxiv arXiv cs.CL · 2d ago

Storyline Trees: Hierarchical Representations for Long-Form Narratives

Storyline trees provide hierarchical structures for long-form narratives by segmenting chapters into scenes and inferring narrative layers through top-down and bottom-up procedures. These trees enable adaptive retrieval, improving question-answering performance on three long-context narrative benchmarks compared to baseline methods, with gains confirmed through ablation studies.

arxiv arXiv cs.CL · 2d ago

Using LLM Internal Artifacts to Improve Legal Classification Reliability

This study explores leveraging internal artifacts of large language models to detect incorrect predictions in legal classification tasks. The approach uses features from these artifacts to build classifiers that identify erroneous outputs in bail decision and statute violation predictions. Results show internal artifacts reliably indicate incorrect responses, enhancing the overall reliability of LLM-based legal classification systems.

arxiv arXiv cs.CL · 2d ago

Token-Level Comparison of Transformers and Hybrid Models

A study using Olmo 3 and Olmo Hybrid open weights finds hybrid models outperform transformers on open-class content words and opening delimiters. The gains are less consistent for closed-class function words and closing delimiters, with hybrids excelling in semantic state tasks like pronoun memory and entity tracking, while transformers perform better on bracket-matching tasks. These results suggest recurrent layers enhance state-aware predictions, while attention supports n-gram and syntactic pattern recognition.

arxiv arXiv cs.CL · 2d ago

ViGiL3D++ Enables Diverse Language Generation for 3D Visual Grounding

ViGiL3D++ introduces a scalable, scene-agnostic method that generates diverse visual grounding queries by combining constraint sampling in scene graphs with large language model language generation. It outperforms existing models on multiple 3D visual grounding benchmarks and reveals key limitations of current vision-language models.

arxiv arXiv cs.CL · 2d ago

Test-Time Steering Resolves Temporal Fact Conflicts in LLMs

Researchers identify parametric temporal conflicts in language models where outdated facts persist in parameters. They introduce Temporal Attractor Steering (TAS), a test-time method that resolves 29-57% of such conflicts without retraining, maintaining 85-99% accuracy on non-conflict queries and outperforming a baseline on three of four models.

arxiv arXiv cs.CL · 2d ago

Metanym Game: Self-Contained LLM Benchmark for Structural Intelligence

The Metanym Game introduces a contamination-resistant benchmark for LLMs that measures structural intelligence through dynamic, on-the-fly analogy creation. A singular value decomposition of evaluator ratings reveals both generation and judging competence, with factual accuracy correlating strongly to GPQA Diamond at r = 0.92. Judging is a rarer skill: top generators are average judges, while top judges produce mid-tier outputs, and the strongest models earn seats in a council that self-rates and governs the benchmark.

arxiv arXiv cs.CL · 2d ago

LLMs Fall for Deception More Than Humans

A study finds that all 21 evaluated LLMs fall for deceptive traps at a significantly higher rate than human attackers. Despite recognizing traps in their reasoning, LLMs exploit deceptive elements 73.4% of the time, with no correlation between recognition and behavior (Spearman r = +0.-08, p = 0.73). These results show human-centered deception theories fail to apply to AI attackers, calling for AI-native defense research.

arxiv arXiv cs.CL · 2d ago

Demographic Metadata Harms DistilBERT Essay Scoring

A study finds that concatenating demographic metadata with text in DistilBERT-based essay scoring models degrades predictive accuracy and increases scoring bias. The experimental model achieved a lower Quadratic Weighted Kappa (0.656 vs. 0.727) and higher validation loss (1.29 vs. 1.25), with score parity dropping from 15 to 12 out of 19 tests.

arxiv arXiv cs.CL · 2d ago

FiLM-Coordinated Dual-Branch Transformer for Language Modeling

A new Transformer architecture introduces separate global and local branches for language modeling, using FiLM to dynamically coordinate them. Experiments show it outperforms single-branch and weakened dual-branch models on small datasets like TinyShakespeare and WikiText-2, with stable results across multiple seeds and channel-selective modulation patterns.

arxiv arXiv cs.CL · 2d ago

OTTER: Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization

OTTER is a black-box red-teaming framework that bypasses toxicity filters by modifying as few as five tokens. Evaluated on 457 AdvBench prompts across four GPT models, it increases jailbreak success rate from 7.0% to 84.0%, offering the first quantitative analysis of toxicity-bypass relationships and actionable recommendations for classifier hardening.

arxiv arXiv cs.CL · 2d ago

GRAG Framework Decouples Grounding and Personalization in Conversational AI

GRAG decouples content grounding and personalization in conversational models by using generic responses from large language models as a structural scaffold. This approach enables smaller, resource-limited models to achieve up to 47% improvement in ROUGE-2 and 36% in BLEU scores over state-of-the-art methods on diverse benchmarks.

arxiv arXiv cs.CL · 2d ago

Validation-Gated Mechanistic Analysis of Suicidality Detection in LLMs

A validation-gated framework evaluates LLM internal features only after observed behavior, revealing a mid-network feature that causally contributes to suicide detection. This feature is semantic, low-rank, cross-model, and specific to suicidality over general distress, though steering is necessary but not sufficient. The pattern shows smaller models encode suicidality but only larger ones act on it, with evidence limited to English Reddit text.

arxiv arXiv cs.CL · 2d ago

Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection

A new hierarchical attention model detects multi-turn jailbreaks by encoding turns into compact representations and using a lightweight conversation module to capture dialogue dynamics. On 14,038 conversations, it achieves an F1 score of 0.9394, outperforming Claude Opus 4.7 by 0.07 and reducing false-positive rate by half. Ablation studies show that combining cross-attention and self-attention in the conversation module lowers false positives by 2.26 percentage points.

arxiv arXiv cs.CL · 2d ago

LLM-Based Multi-Reference Evaluation for Phrase Break Annotations

LMRE addresses limitations of single-reference evaluation by modeling multiple valid phrasings of speech. It outperforms traditional methods in aligning with human judgment on acceptance and scoring, demonstrating scalability and robustness for Korean speech annotations.

arxiv arXiv cs.CL · 2d ago

Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making

Answer Engineering introduces a runtime layer that applies localized rule-based corrections to a model's reasoning trajectory during generation, without retraining. In a clinical benchmark for sudden sensorineural hearing loss, it increased protocol-compliant outcomes from 54.5% to 83.5% and conductive-case adherence from 1.6% to 58.9%.

arxiv arXiv cs.CL · 2d ago

Coherence Illusions in Dutch LLMs Revealed

Dutch language models exhibit coherence illusions similar to human readers. Surprisal and attention entropy metrics show that models are misled by context matches, with energy from associative memory identifying discourse coherence mechanisms.

arxiv arXiv cs.CL · 2d ago

Multi-Agent Audit Framework for Clinical Mental Health Screening

A multi-agent audit framework improves clinical mental health screening by decomposing reasoning into perception, retrieval, inference, and audit stages. Evaluated on the DAIC-WOZ dataset, it reduces PHQ-8 depression severity prediction error from 5.35 to 5.02 and offers interpretable, verifiable diagnostic rationales.

arxiv arXiv cs.CL · 2d ago

Study Finds AI Still Fails to Detect Legal Citation Hallucinations

A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.