Abstract
Persona agents, which are LLM agents that act according to an assignedpersona, have demonstrated impressive contextual response capabilities acrossvarious applications. These persona agents offer significant enhancementsacross diverse sectors, such as education, healthcare, and entertainment, wheremodel developers can align agent responses to different user requirementsthereby broadening the scope of agent applications. However, evaluating personaagent performance is incredibly challenging due to the complexity of assessingpersona adherence in free-form interactions across various environments thatare relevant to each persona agent. We introduce PersonaGym, the first dynamicevaluation framework for assessing persona agents, and PersonaScore, the firstautomated human-aligned metric grounded in decision theory for comprehensivelarge-scale evaluation of persona agents. Our evaluation of 6 open andclosed-source LLMs, using a benchmark encompassing 200 personas and 10,000questions, reveals significant opportunities for advancement in persona agentcapabilities across state-of-the-art models. For example, Claude 3.5 Sonnetonly has a 2.97% relative improvement in PersonaScore than GPT 3.5 despitebeing a much more advanced model. Importantly, we find that increased modelsize and complexity do not necessarily imply enhanced persona agentcapabilities thereby highlighting the pressing need for algorithmic andarchitectural invention towards faithful and performant persona agents.