Evaluation-Driven Development of LLM Agents: A Process Model and Reference Architecture

Abstract

Large Language Models (LLMs) have enabled the emergence of LLM agents:autonomous systems capable of achieving under-specified goals and adaptingpost-deployment, often without explicit code or model changes. Evaluating theseagents is critical to ensuring their performance and safety, especially giventheir dynamic, probabilistic, and evolving nature. However, traditionalapproaches such as predefined test cases and standard redevelopment pipelinesstruggle to address the unique challenges of LLM agent evaluation. Thesechallenges include capturing open-ended behaviors, handling emergent outcomes,and enabling continuous adaptation over the agent's lifecycle. To address theseissues, we propose an evaluation-driven development approach, inspired bytest-driven and behavior-driven development but reimagined for the uniquecharacteristics of LLM agents. Through a multivocal literature review (MLR), wesynthesize the limitations of existing LLM evaluation methods and introduce anovel process model and reference architecture tailored for evaluation-drivendevelopment of LLM agents. Our approach integrates online (runtime) and offline(redevelopment) evaluations, enabling adaptive runtime adjustments andsystematic iterative refinement of pipelines, artifacts, system architecture,and LLMs themselves. By continuously incorporating evaluation results,including fine-grained feedback from human and AI evaluators, into each stageof development and operation, this framework ensures that LLM agents remainaligned with evolving goals, user needs, and governance standards.