arxiv arXiv cs.AI · 1d ago · src: 6d ago · research

LLMs Benchmarked for Web Vulnerability Detection

from English

A study evaluates six LLMs on detecting real-world web vulnerabilities in WordPress plugins, finding detection rates vary by model and prompt design. Claude Opus 4.6 achieved the highest detection rate at 63%, while Qwen 3.5 only reached 35%, and no model consistently identified all baseline vulnerabilities across iterations.

Importance 2/3 arXiv cs.AI Mistral AI Alibaba (Qwen) xAI Code generation Evaluation & benchmarks Reasoning models

Benchmarks

Benchmark	Model	Score
SWE-bench Verified	Claude Opus 4.6	63%
SWE-bench Verified	MiniMax M2.5	48%
SWE-bench Verified	Qwen 3.5	35%

Read original