The authors characterize inference-time pattern-memory gating in a production-scale clinical NLP pipeline that pairs a Llama-3.3 70B generator with an MMed-Llama-3.1 70B verifier across 167,034 PMC-Patients narratives.
- Learning filtering rules directly from the verifier's rejections failed because they were spread too thinly across distinct forms.
- A simpler rule using a fixed clinical ontology captured 49,734 ontology-violating relations on a held-out set without the verifier.
- Four of five question-answering filters failed; the fifth succeeded by checking if entities support the question, flagging rejected answers 1.84 times more often.
- A filter is selective only when it tests the same evidence the verifier weighs, not when it imitates the verifier's output.
The study demonstrates that natural memory designs can fail silently at scale and that pre-generation gate selectivity depends on probing the question the verifier answers.