This article introduces a signal-coverage matrix to stratify type and semantic errors in LLM autoformalization, moving beyond scalar type-correctness metrics. The framework categorizes outputs into true success, type-only, semantic-only, or both fail cells by crossing Lean elaborator results with semantic equivalence judgments.

  • Analysis of ProofNet# and MiniF2F-test with DeepSeek V4-Pro shows that elab-feedback methods recover approximately 64% of type-stratum errors, increasing true success rates by 34 to 36 percentage points.
  • The type-only to true success recovery rate predicts delta true success on held-out methods within 2/186 and renders delta type-correctness linear in the vanilla elaborator fail rate with an R-squared of 0.96.
  • Symbolic judges disagree with semantic judges by 26 to 37 percentage points on elab-feedback outputs, with false negatives often traceable to elaborator-forced rewrites.

The authors argue that gains in type-correctness should be attributed to specific error cell movements rather than scalar improvements alone, as this reveals which methods effectively resolve distinct error types.