The Pre-Flight benchmark evaluates large language models on aviation operational knowledge, revealing a significant gap between model performance and expert human capability. Comprising 300 multiple-choice questions authored by aviation practitioners, the benchmark tests understanding of international standards, ICAO and US FAA regulations, and ground operations.

  • The dataset covers international airport ground operations, regulatory frameworks, and complex operational scenarios.
  • Evaluation was conducted using the Inspect evaluation framework with a standard multiple-choice protocol.
  • Even the strongest model evaluated in 2026 achieved only 82.7% accuracy, compared to an expert reference of approximately 95%.
  • Performance improved gradually from roughly 75% in early 2025 but remains substantially below expert-level reliability.

The authors argue that domain-specific evaluation is a necessary precondition for the responsible deployment of generative AI in non-safety-critical aviation operations.