NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

Researchers introduce NuclearQAv2, a new benchmark designed to assess the reliability of large language models in nuclear engineering by testing factual knowledge, quantitative reasoning, and conceptual understanding.

The benchmark consists of approximately 1,240 question-answer pairs categorized as boolean, numeric, or verbal.
It is constructed via a hybrid pipeline combining expert-authored questions, existing datasets, and LLM-assisted generation from technical corpora.
Evaluation reveals that while models handle factual questions well, quantitative reasoning and conceptual understanding remain significantly more challenging.

This work establishes NuclearQAv2 as a scalable framework for evaluating LLM capabilities in technical domains, highlighting the need for multi-faceted assessment beyond simple factual recall.