A study evaluates six LLMs on detecting real-world web vulnerabilities in WordPress plugins, finding detection rates vary by model and prompt design. Claude Opus 4.6 achieved the highest detection rate at 63%, while Qwen 3.5 only reached 35%, and no model consistently identified all baseline vulnerabilities across iterations.
arxiv
arXiv cs.AI
·
1d ago
·
src: 6d ago
·
research
LLMs Benchmarked for Web Vulnerability Detection
from English
Importance 2/3
arXiv cs.AI
Mistral AI
Alibaba (Qwen)
xAI
Code generation
Evaluation & benchmarks
Reasoning models
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| SWE-bench Verified | Claude Opus 4.6 | 63% |
| SWE-bench Verified | MiniMax M2.5 | 48% |
| SWE-bench Verified | Qwen 3.5 | 35% |