Closed model benchmark gaps may be smaller than assumed due to hidden system enhancements

The article argues that the performance gap between closed and open models is likely overstated because benchmarks compare raw model inference against full product ecosystems. Closed providers can significantly boost results through backend techniques like RAG, prompt preprocessing, and specialized expert models without revealing these additions.

Benchmarks often compare GLM's raw inference with Claude's entire product suite, creating an unfair comparison.
Providers may use hidden internal tool calls, context-dependent system prompts, or "clown-car MoE" architectures to improve output.
Anthropic already redacts reasoning traces and restricts access to full conversations, obscuring these enhancements.
It is possible that no single closed model's inference output actually beats open models when isolated.

The author suggests that without visibility into the backend processing, it is impossible to accurately assess the true capabilities of the underlying models.