Pre-Flight benchmark reveals LLMs lag expert reliability in aviation operational knowledge

The Pre-Flight benchmark evaluates large language models on aviation operational knowledge, revealing a significant gap between model performance and expert human capability. Comprising 300 multiple-choice questions authored by aviation practitioners, the benchmark tests understanding of international standards, ICAO and US FAA regulations, and ground operations.

The dataset covers international airport ground operations, regulatory frameworks, and complex operational scenarios.
Evaluation was conducted using the Inspect evaluation framework with a standard multiple-choice protocol.
Even the strongest model evaluated in 2026 achieved only 82.7% accuracy, compared to an expert reference of approximately 95%.
Performance improved gradually from roughly 75% in early 2025 but remains substantially below expert-level reliability.

The authors argue that domain-specific evaluation is a necessary precondition for the responsible deployment of generative AI in non-safety-critical aviation operations.