Red Teaming Framework Uncovers LLM Faithfulness Vulnerabilities via Multi-Role Architecture

This paper introduces a red teaming framework designed to systematically uncover vulnerabilities in large language model outputs through a multi-role architecture. The system utilizes target, attacker, and jury models to generate adversarial prompts and rigorously evaluate response accuracy and consistency. In a case study on faithfulness evaluation, exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks. The research demonstrates that architectural design choices typically outweigh parameter scaling in determining model safety and identifies how structural constraints shape vulnerability patterns. The framework shows adaptability across diverse evaluation tasks, ranging from English question-answering to Arabic summarization. However, the approach faces challenges in fully automating adversarial prompt generation across different languages. Additionally, experiments reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions.