Reasoning models
media Hugging Face Forums · 3d ago

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

The Buddy System uses a Rust entropy monitor to detect per-token uncertainty in local Gemma 3 4B inference, routing only uncertain tokens to Sonnet via NER-gated span extraction and semantic retrieval. Benchmarks show it achieves 71.4% accuracy at $0.21, outperforming the Anthropic Advisor pattern (62.9% at $0.44) across seven Hugging Face datasets, with a key improvement on SQuAD v2 by routing source passage chunks to the cloud model.

arxiv arXiv cs.CL · 3d ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A 300M-parameter SLM fine-tuned on general-domain data achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, the SLM reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models on average. These results demonstrate that task-adapted small models can deliver accurate, private, and hardware-efficient performance without relying on large-scale generative models.

arxiv arXiv cs.CL · 3d ago

Using LLM Internal Artifacts to Improve Legal Classification Reliability

This study explores leveraging internal artifacts of large language models to detect incorrect predictions in legal classification tasks. The approach uses features from these artifacts to build classifiers that identify erroneous outputs in bail decision and statute violation predictions. Results show internal artifacts reliably indicate incorrect responses, enhancing the overall reliability of LLM-based legal classification systems.

arxiv arXiv cs.CL · 3d ago

Token-Level Comparison of Transformers and Hybrid Models

A study using Olmo 3 and Olmo Hybrid open weights finds hybrid models outperform transformers on open-class content words and opening delimiters. The gains are less consistent for closed-class function words and closing delimiters, with hybrids excelling in semantic state tasks like pronoun memory and entity tracking, while transformers perform better on bracket-matching tasks. These results suggest recurrent layers enhance state-aware predictions, while attention supports n-gram and syntactic pattern recognition.

arxiv arXiv cs.CL · 3d ago

Metanym Game: Self-Contained LLM Benchmark for Structural Intelligence

The Metanym Game introduces a contamination-resistant benchmark for LLMs that measures structural intelligence through dynamic, on-the-fly analogy creation. A singular value decomposition of evaluator ratings reveals both generation and judging competence, with factual accuracy correlating strongly to GPQA Diamond at r = 0.92. Judging is a rarer skill: top generators are average judges, while top judges produce mid-tier outputs, and the strongest models earn seats in a council that self-rates and governs the benchmark.