arxiv arXiv cs.AI · 6d ago · research

Calibration Without Comprehension in LLM Vulnerability Detection

from English

CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.

Importance 3/3 New harness with differentiators arXiv cs.AI DeepSeek Meta AI OpenAI Evaluation & benchmarks Reasoning models Safety & alignment

Benchmarks

Benchmark	Model	Score
SWE-bench Verified	DeepSeek-R1	52.1%

Read original