Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

Litmus is a zero-label system that designs evaluation and monitoring metrics for AI pipelines by eliciting evaluation intent from source code and targeted interrogation. Instead of assuming the evaluation target is known, it identifies what must be measured and why to construct a justified metric portfolio.

Evaluated on three real, code-defined AI pipelines: financial account grouping, scientific QA, and inherent risk assessment.
Achieved the broadest or tied-broadest concern coverage and spans more pipeline stages than AutoMetrics and three DynamicRubric baselines.
Produced a near-zero-redundancy portfolio and ranked first in validity against per-row quality labels on all three pipelines.
Decisively outperformed baselines on scientific QA with a Spearman correlation of 0.72 versus less than 0.47 for every baseline.

The results support a shift from automatic metric implementation to automatic metric specification, arguing that evaluation systems should first determine what must be measured and why.