Reasoning models
arxiv arXiv cs.CL · 2d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

arxiv arXiv cs.CL · 2d ago

Two-Stage Alignment Improves Math Tutoring Pedagogy

A two-stage alignment pipeline enhances large language models' pedagogical performance in math mistake remediation. The approach combines supervised fine-tuning with Direct Preference Optimization using synthetic data on scaffolding and factuality, outperforming base and existing tutoring models in both accuracy and teaching quality. Human evaluations show the model competes with a proprietary baseline, offering greater openness and reproducibility.

arxiv arXiv cs.CL · 2d ago

MedHal-Loc Benchmark Tests Localization Faithfulness in Medical Hallucination Detectors

MedHal-Loc introduces a benchmark to evaluate whether medical hallucination detectors accurately localize errors. It finds that while some architectures localize well above chance, a knowledge-graph pipeline performs no better than random due to poor entity extraction, despite strong detection performance. The results show that detection capability does not guarantee faithful localization, challenging assumptions about architectural explainability.

media Hugging Face Forums · 3d ago

Capability Is Not in the Weights: Empirical Negative Result on MLP Weight Projection

An empirical study found that projecting MLP weights from one transformer model into another fails to transfer semantic capability. Every tested variant performed worse than the unmodified host model, indicating a structural limitation in weight projection. The results challenge public claims about model capabilities based on benchmarks, showing such claims do not reflect actual internal weight geometry.