media r/LocalLLaMA · 9d ago · open_models

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

from English

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

Importance 3/3 Beats a top-lab benchmark r/LocalLLaMA Alibaba (Qwen) DeepSeek Mistral AI Evaluation & benchmarks Reasoning models Safety & alignment

Read original