Ranking SWE-rebench adiciona GLM-5.2, Qwen3.6, Gemma 4 e melhora a interface

O ranking SWE-rebench foi atualizado com novas entradas de modelos e uma interface do usuário redesenhada para facilitar a comparação mais fácil dos resultados.

Claude Opus 4.8 xhigh lidera com 56,5% de resolução usando 2,48M tokens.
GLM-5.2 alcança 51,1% com 2,62M tokens.
Gemini 3.5 Flash pontua 49,5% usando 1,85M tokens.
MiniMax M3 atinge 45,6% com 6,89M tokens.
DeepSeek-V4 Pro obtém 42,7% usando 2,25M tokens.
MiMo V2.5 Pro pontua 42,4% com 2,59M tokens.
DeepSeek-V4 Flash alcança 38,4% usando 3,00M tokens.
Qwen3.6-27B atinge 36,5% com 1,88M tokens.
Qwen3.6-35B-A3B pontua 33,8% usando 2,23M tokens.
Gemma 4 31B alcança 16,5% com 2,24M tokens.

A atualização destaca modelos locais e auto-hospedados, observando Qwen3.6-27B como particularmente forte para seu tamanho.

Benchmarks

Benchmark	Modelo	Pontuação
SWE-rebench	Claude Opus 4.8 xhigh	56.5%
SWE-rebench	GLM-5.2	51.1%
SWE-rebench	Gemini 3.5 Flash	49.5%
SWE-rebench	MiniMax M3	45.6%
SWE-rebench	DeepSeek-V4 Pro	42.7%
SWE-rebench	MiMo V2.5 Pro	42.4%
SWE-rebench	DeepSeek-V4 Flash	38.4%
SWE-rebench	Qwen3.6-27B	36.5%
SWE-rebench	Qwen3.6-35B-A3B	33.8%
SWE-rebench	Gemma 4 31B	16.5%

Benchmark

Modelo

Pontuação

SWE-rebench

Claude Opus 4.8 xhigh

56.5%

SWE-rebench

GLM-5.2

51.1%

SWE-rebench

Gemini 3.5 Flash

49.5%

SWE-rebench

MiniMax M3

45.6%

SWE-rebench

DeepSeek-V4 Pro

42.7%

SWE-rebench

MiMo V2.5 Pro

42.4%

SWE-rebench

DeepSeek-V4 Flash

38.4%

SWE-rebench

Qwen3.6-27B

36.5%

SWE-rebench

Qwen3.6-35B-A3B

33.8%

SWE-rebench

Gemma 4 31B

16.5%