Benchmark · agentic

SWE-bench Verified

Human-verified subset of SWE-bench; resolving GitHub issues end-to-end.

50 results 42 models
0 50 100 150 200 2026-06-17 2026-06-20 2026-06-24 Codex with GPT-5.5 · 90 · 2026-06-17 Codex with GPT-5.5 · 90 · 2026-06-17 Codex with GPT-5.5 · 90 · 2026-06-17 GitHub Copilot · 80.2 · 2026-06-17 Devin · 80.2 · 2026-06-17 Cursor · 80.2 · 2026-06-17 Claude Code · 80.2 · 2026-06-17 OpenAI Codex · 80.2 · 2026-06-17 GPT-3.5 Turbo · 97 · 2026-06-17 GPT-3.5 Turbo · 97 · 2026-06-17 GPT-3.5 Turbo · 97 · 2026-06-17 Claude Opus 4.6 · 1.0 · 2026-06-17 Claude Opus 4.6 · 63 · 2026-06-23 EAGG · 56.2 · 2026-06-17 ProvenanceGuard · 0.8 · 2026-06-17 ProvenanceGuard · 0.8 · 2026-06-17 domain-specific composite tools · 90 · 2026-06-17 offline preference-based trajectory evaluation · 35 · 2026-06-17 Claude Sonnet 4.6 · 8.8 · 2026-06-17 Llama 3.1-8B · 31.7 · 2026-06-17 Qwen3-8B · 21.3 · 2026-06-17 LoopCoder-v2 · 64.4 · 2026-06-17 LoopCoder-V2 · 64.4 · 2026-06-17 GrapNet+ER · 63.2 · 2026-06-18 MLP+ER · 51.1 · 2026-06-18 GrapNet · 3.8 · 2026-06-18 Gemma-2-2B · 99.6 · 2026-06-19 Qwen2.5-1.5B · 99.6 · 2026-06-19 Llama-3.2-1B · 99.6 · 2026-06-19 Minstral-3-3B · 99.6 · 2026-06-19 Qwen3-4B · 74.3 · 2026-06-19 GatorTron · 1.0 · 2026-06-19 FineREX · 15.5 · 2026-06-19 Qwen2.5-7B-Instruct · 0.5 · 2026-06-19 router · 0.7 · 2026-06-19 router · 0.7 · 2026-06-19 baseline · 0.6 · 2026-06-19 baseline · 0.6 · 2026-06-19 DeepSeek-R1 · 52.1 · 2026-06-19 GLM 5.2 · 98 · 2026-06-20 GLM-5.2 · 0 · 2026-06-21 Qwen3.6 27B · 79.6 · 2026-06-21 three machines, two small language models, and three retrieval/in-context prompting approaches · 73.1 · 2026-06-23 Qwen 3.5 · 35 · 2026-06-23 MiniMax M2.5 · 48 · 2026-06-23 DeBERTa · 90 · 2026-06-24 rule-based · 43 · 2026-06-24 strongest large language model · 77.8 · 2026-06-24 MahaBERT-v2 · 88.7 · 2026-06-24 SHERLOC · 81.3 · 2026-06-24
Codex with GPT-5.5 GitHub Copilot Devin Cursor Claude Code OpenAI Codex GPT-3.5 Turbo Claude Opus 4.6 EAGG ProvenanceGuard domain-specific composite tools offline preference-based trajectory evaluation Claude Sonnet 4.6 Llama 3.1-8B Qwen3-8B LoopCoder-v2 LoopCoder-V2 GrapNet+ER MLP+ER GrapNet Gemma-2-2B Qwen2.5-1.5B Llama-3.2-1B Minstral-3-3B Qwen3-4B GatorTron FineREX Qwen2.5-7B-Instruct router baseline DeepSeek-R1 GLM 5.2 GLM-5.2 Qwen3.6 27B three machines, two small language models, and three retrieval/in-context prompting approaches Qwen 3.5 MiniMax M2.5 DeBERTa rule-based strongest large language model MahaBERT-v2 SHERLOC
Timeline
  1. 2026-06-24 MahaBERT-v2 88.67% L3Cube-MahaPOS: Marathi POS Tagging Dataset and BERT Models
  2. 2026-06-24 SHERLOC 81.27% SHERLOC: Structured Diagnostic Localization for Code Repair Agents
  3. 2026-06-24 DeBERTa 90.0pts AutoSpecNER: Fine-Grained NER Dataset for Vehicle Specifications
  4. 2026-06-24 rule-based 43.0pts AutoSpecNER: Fine-Grained NER Dataset for Vehicle Specifications
  5. 2026-06-24 strongest large language model 77.8pts AutoSpecNER: Fine-Grained NER Dataset for Vehicle Specifications
  6. 2026-06-23 Qwen 3.5 35.0% LLMs Benchmarked for Web Vulnerability Detection
  7. 2026-06-23 Claude Opus 4.6 63.0% LLMs Benchmarked for Web Vulnerability Detection
  8. 2026-06-23 MiniMax M2.5 48.0% LLMs Benchmarked for Web Vulnerability Detection
  9. 2026-06-23 three machines, two small language models, and three retrieval/in-context prompting approaches 73.1% The Token Tax of Epistemic Accuracy in Document-Grounded AI
  10. 2026-06-21 Qwen3.6 27B 79.6pts Updated Vision Model Benchmark Results and Recommendations
  11. 2026-06-21 GLM-5.2 0.0% GLM-5.2 Beats Gemini and GPT-5.4 in Coding but Is Inefficient
  12. 2026-06-20 GLM 5.2 98.0% GLM 5.2 Achieves 98% Max Intelligence with Less Than Half Tokens
  13. 2026-06-19 DeepSeek-R1 52.1% Calibration Without Comprehension in LLM Vulnerability Detection
  14. 2026-06-19 router 0.694null Adaptive LLM Tutoring Improves Engagement and Efficiency
  15. 2026-06-19 baseline 0.647null Adaptive LLM Tutoring Improves Engagement and Efficiency
  16. 2026-06-19 router 0.694null Adaptive LLM Tutoring Improves Engagement and Efficiency
  17. 2026-06-19 baseline 0.647null Adaptive LLM Tutoring Improves Engagement and Efficiency
  18. 2026-06-19 Qwen2.5-7B-Instruct 0.481% Train, Retrieve, or Both? Head-to-Head on Statutory Citation for Ontario RTA
  19. 2026-06-19 FineREX 15.5% FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs
  20. 2026-06-19 GatorTron 0.96null Zero-Shot Agentic LLMs Extract Lung Pathology from Narratives
  21. 2026-06-19 Qwen3-4B 74.27% STAGE: Source-Grounded Data Generation for Text-to-JSON
  22. 2026-06-19 Gemma-2-2B 99.6% Causal Activation Directions for Mitigating Emergent Misalignment in Language Models
  23. 2026-06-19 Qwen2.5-1.5B 99.6% Causal Activation Directions for Mitigating Emergent Misalignment in Language Models
  24. 2026-06-19 Llama-3.2-1B 99.6% Causal Activation Directions for Mitigating Emergent Misalignment in Language Models
  25. 2026-06-19 Minstral-3-3B 99.6% Causal Activation Directions for Mitigating Emergent Misalignment in Language Models
  26. 2026-06-18 GrapNet+ER 63.16percent GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate
  27. 2026-06-18 MLP+ER 51.08percent GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate
  28. 2026-06-18 GrapNet 3.81pts GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate
  29. 2026-06-17 LoopCoder-V2 64.4% LoopCoder-V2: Two-Loop PLT Model Achieves Best Gain-Cost Trade-Off
  30. 2026-06-17 LoopCoder-v2 64.4pts LoopCoder-v2 Achieves Optimal Two-Loop Performance
  31. 2026-06-17 GPT-3.5 Turbo 97.0% Handlebars Triple-Brace Injection Exploits Structural Role Delimiters
  32. 2026-06-17 Codex with GPT-5.5 90.0% ReproRepo: Scalable Reproducibility Audits with GitHub Issues
  33. 2026-06-17 ProvenanceGuard 0.802null ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
  34. 2026-06-17 Claude Sonnet 4.6 8.8% Geographic Bias in Large Language Models from User Metadata
  35. 2026-06-17 Llama 3.1-8B 31.7% Geographic Bias in Large Language Models from User Metadata
  36. 2026-06-17 Qwen3-8B 21.3% Geographic Bias in Large Language Models from User Metadata
  37. 2026-06-17 GPT-3.5 Turbo 97.0% Handlebars Triple-Brace Injection Exploits Structural Role Delimiters
  38. 2026-06-17 Codex with GPT-5.5 90.0% ReproRepo: Scaling Reproducibility Audits with GitHub Issues
  39. 2026-06-17 offline preference-based trajectory evaluation 35.0% Preference-Based Trajectory Evaluation for Agentic Systems
  40. 2026-06-17 domain-specific composite tools 90.0% T-API-Compliant ReAct Loop for Optical Networks
  41. 2026-06-17 ProvenanceGuard 0.802null ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
  42. 2026-06-17 EAGG 56.17% EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning
  43. 2026-06-17 Claude Opus 4.6 0.97% ALeRCE Launches Text-to-SQL System with LLMs
  44. 2026-06-17 GPT-3.5 Turbo 97.0% Handlebars Triple-Brace Injection Exploits Structural Role Delimiters
  45. 2026-06-17 GitHub Copilot 80.2% Oracle Signals in Agent-Authored Test Code
  46. 2026-06-17 Devin 80.2% Oracle Signals in Agent-Authored Test Code
  47. 2026-06-17 Cursor 80.2% Oracle Signals in Agent-Authored Test Code
  48. 2026-06-17 Claude Code 80.2% Oracle Signals in Agent-Authored Test Code
  49. 2026-06-17 OpenAI Codex 80.2% Oracle Signals in Agent-Authored Test Code
  50. 2026-06-17 Codex with GPT-5.5 90.0% ReproRepo: Scaling Reproducibility Audits with GitHub Issues