Reasoning models
arxiv arXiv cs.CL · 2d ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A 300M-parameter SLM fine-tuned on general-domain data achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, the SLM reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models on average. These results demonstrate that task-adapted small models can deliver accurate, private, and hardware-efficient performance without relying on large-scale generative models.

arxiv arXiv cs.CL · 2d ago

Using LLM Internal Artifacts to Improve Legal Classification Reliability

This study explores leveraging internal artifacts of large language models to detect incorrect predictions in legal classification tasks. The approach uses features from these artifacts to build classifiers that identify erroneous outputs in bail decision and statute violation predictions. Results show internal artifacts reliably indicate incorrect responses, enhancing the overall reliability of LLM-based legal classification systems.

arxiv arXiv cs.CL · 2d ago

Token-Level Comparison of Transformers and Hybrid Models

A study using Olmo 3 and Olmo Hybrid open weights finds hybrid models outperform transformers on open-class content words and opening delimiters. The gains are less consistent for closed-class function words and closing delimiters, with hybrids excelling in semantic state tasks like pronoun memory and entity tracking, while transformers perform better on bracket-matching tasks. These results suggest recurrent layers enhance state-aware predictions, while attention supports n-gram and syntactic pattern recognition.

arxiv arXiv cs.CL · 2d ago

Metanym Game: Self-Contained LLM Benchmark for Structural Intelligence

The Metanym Game introduces a contamination-resistant benchmark for LLMs that measures structural intelligence through dynamic, on-the-fly analogy creation. A singular value decomposition of evaluator ratings reveals both generation and judging competence, with factual accuracy correlating strongly to GPQA Diamond at r = 0.92. Judging is a rarer skill: top generators are average judges, while top judges produce mid-tier outputs, and the strongest models earn seats in a council that self-rates and governs the benchmark.

arxiv arXiv cs.CL · 2d ago

Validation-Gated Mechanistic Analysis of Suicidality Detection in LLMs

A validation-gated framework evaluates LLM internal features only after observed behavior, revealing a mid-network feature that causally contributes to suicide detection. This feature is semantic, low-rank, cross-model, and specific to suicidality over general distress, though steering is necessary but not sufficient. The pattern shows smaller models encode suicidality but only larger ones act on it, with evidence limited to English Reddit text.

arxiv arXiv cs.CL · 2d ago

Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection

A new hierarchical attention model detects multi-turn jailbreaks by encoding turns into compact representations and using a lightweight conversation module to capture dialogue dynamics. On 14,038 conversations, it achieves an F1 score of 0.9394, outperforming Claude Opus 4.7 by 0.07 and reducing false-positive rate by half. Ablation studies show that combining cross-attention and self-attention in the conversation module lowers false positives by 2.26 percentage points.