Measurement Gap in EU Law Automation

Large language models can generate median-quality legal text, but no benchmark evaluates their ability to perform doctrinal legal reasoning. This gap undermines the EU AI Act's requirement of 'appropriate accuracy' in judicial AI, as the necessary operational definition lacks a doctrinal-reasoning evaluation standard.