Multi-Agent Debate for Software Architecture Governance: A Framework for ADR Generation, Risk Review, and Deployment Readiness
Abstract
Architecture review at most organizations is a serial, expert-bound activity: a small number of senior engineers read a design, apply tacit heuristics, and produce decisions whose rationale is rarely captured in a structured, auditable form. Single large language models (LLMs) have been proposed to assist this work, but a single model that both proposes and reviews a design may reinforce assumptions from its own proposal, agree with itself, and present trade-offs as settled rather than contested. We propose MAD-Arch (Multi-Agent Debate for Architecture), a role-partitioned, human-in-the-loop framework for software architecture governance rather than mere critique. MAD-Arch separates design generation from design criticism: a proposal agent produces a baseline, a panel of adversarial reviewer roles—scalability, security, reliability, delivery, and cost—independently critiques it, the roles debate their disagreements over bounded rounds, and an arbiter consolidates the exchange into structured, version-controlled governance artifacts—Architecture Decision Records (ADRs), a security and compliance risk register, an operational-readiness checklist, and deployment-governance recommendations—together with a final recommendation carrying a transparent confidence score and an explicit list of unresolved assumptions. A human architect remains the accountable decision-maker at a mandatory approval gate, and the artifacts are designed to integrate with pull requests and CI/CD gates. We contribute (i) the generation– criticism separation as an organizing principle, (ii) a reproducible debate workflow, (iii) a rubric-based arbiter confidence model, and (iv) a conceptual evaluation protocol over three reference scenarios—a cloud-native order-processing system, an eventdriven IoT telemetry platform, and an AI-powered documentprocessing pipeline—that compares MAD-Arch against singlemodel and checklist-only human review using precision/recallstyle risk scoring, trade-off coverage, artifact completeness, and a redundant-recommendation rate. We report illustrative pilotstyle values that demonstrate the protocol rather than prove effectiveness; we conducted no production deployment, and empirical validation remains future work. We position MADArch as decision support rather than autonomous architecture decision-making.