Researchers introduce NuclearQAv2, a new benchmark designed to assess the reliability of large language models in nuclear engineering by testing factual knowledge, quantitative reasoning, and conceptual understanding.
- The benchmark consists of approximately 1,240 question-answer pairs categorized as boolean, numeric, or verbal.
- It is constructed via a hybrid pipeline combining expert-authored questions, existing datasets, and LLM-assisted generation from technical corpora.
- Evaluation reveals that while models handle factual questions well, quantitative reasoning and conceptual understanding remain significantly more challenging.
This work establishes NuclearQAv2 as a scalable framework for evaluating LLM capabilities in technical domains, highlighting the need for multi-faceted assessment beyond simple factual recall.