Benchmark · agentic

SWE-bench

Original 2,294-issue suite; superseded for headlines by Verified.

12 results 9 models

Qwable-v1 ContextRL offline preference-based trajectory evaluation Qwen3-4B Qwen3-8B Qwen3-30B-A3B Qwen2.5-0.5B GPT-5.4 Claude Sonnet 4.6

Timeline

2026-06-23 Qwen2.5-0.5B 0.83pts Small Language Models Outperform Frontier LLMs in Relation Extraction
2026-06-23 GPT-5.4 0.69pts Small Language Models Outperform Frontier LLMs in Relation Extraction
2026-06-23 Claude Sonnet 4.6 0.66pts Small Language Models Outperform Frontier LLMs in Relation Extraction
2026-06-18 Qwen3-4B 7.2pts Data Recipe Boosts Long-Context Reasoning in LLMs
2026-06-18 Qwen3-8B 3.2pts Data Recipe Boosts Long-Context Reasoning in LLMs
2026-06-18 Qwen3-30B-A3B 6.4pts Data Recipe Boosts Long-Context Reasoning in LLMs
2026-06-18 Qwen3-4B 7.2pts Data Recipe Boosts Long-Context Reasoning in LLMs
2026-06-18 Qwen3-8B 3.2pts Data Recipe Boosts Long-Context Reasoning in LLMs
2026-06-18 Qwen3-30B-A3B 6.4pts Data Recipe Boosts Long-Context Reasoning in LLMs
2026-06-17 offline preference-based trajectory evaluation 75.0% Preference-Based Trajectory Evaluation for Agentic Systems
2026-06-16 ContextRL 2.2% ContextRL: Context-Aware RL for LLMs
2026-06-16 Qwable-v1 80.3% Qwable-v1 Released as Distillation of Claude Fable-5