media r/LocalLLaMA · 1 小时前 · 来源： 4 天前 · open_models

SWE-rebench排行榜新增GLM-5.2、Qwen3.6、Gemma 4并优化界面

译自 English → 中文

SWE-rebench排行榜已更新，新增了模型条目并重新设计了用户界面，以便更轻松地比较结果。

Claude Opus 4.8 xhigh以使用2.48M tokens达成56.5%的解决率领先。
GLM-5.2以使用2.62M tokens达成51.1%的成绩紧随其后。
Gemini 3.5 Flash以使用1.85M tokens取得49.5%的得分。
MiniMax M3以使用6.89M tokens达到45.6%。
DeepSeek-V4 Pro以使用2.25M tokens实现42.7%的成绩。
MiMo V2.5 Pro以使用2.59M tokens获得42.4%的得分。
DeepSeek-V4 Flash以使用3.00M tokens达成38.4%的成绩。
Qwen3.6-27B以使用1.88M tokens达到36.5%。
Qwen3.6-35B-A3B以使用2.23M tokens取得33.8%的得分。
Gemma 4 31B以使用2.24M tokens实现16.5%的成绩。

此次更新突出了本地和自托管模型，并指出Qwen3.6-27B在其参数量级下表现尤为强劲。

重要性 1/3 r/LocalLLaMA Benchmark results Code generation

Benchmarks

Benchmark	模型	得分
SWE-rebench	Claude Opus 4.8 xhigh	56.5%
SWE-rebench	GLM-5.2	51.1%
SWE-rebench	Gemini 3.5 Flash	49.5%
SWE-rebench	MiniMax M3	45.6%
SWE-rebench	DeepSeek-V4 Pro	42.7%
SWE-rebench	MiMo V2.5 Pro	42.4%
SWE-rebench	DeepSeek-V4 Flash	38.4%
SWE-rebench	Qwen3.6-27B	36.5%
SWE-rebench	Qwen3.6-35B-A3B	33.8%
SWE-rebench	Gemma 4 31B	16.5%