CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.
arxiv
arXiv cs.AI
·
6d ago
·
research
Calibration Without Comprehension in LLM Vulnerability Detection
from English
Importance 3/3
New harness with differentiators
arXiv cs.AI
DeepSeek
Meta AI
OpenAI
Evaluation & benchmarks
Reasoning models
Safety & alignment
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| SWE-bench Verified | DeepSeek-R1 | 52.1% |