AgentMixer: Multi-Agent Correlated Policy Factorization

Abstract

In multi-agent reinforcement learning, centralized training withdecentralized execution (CTDE) methods typically assume that agents makedecisions based on their local observations independently, which may not leadto a correlated joint policy with coordination. Coordination can be explicitlyencouraged during training and individual policies can be trained to imitatethe correlated joint policy. However, this may lead to an \textit{asymmetriclearning failure} due to the observation mismatch between the joint andindividual policies. Inspired by the concept of correlated equilibrium, weintroduce a \textit{strategy modification} called AgentMixer that allows agentsto correlate their policies. AgentMixer combines individual partiallyobservable policies into a joint fully observable policy non-linearly. Toenable decentralized execution, we introduce\textit{Individual-Global-Consistency} to guarantee mode consistency duringjoint training of the centralized and decentralized policies and prove thatAgentMixer converges to an $\epsilon$-approximate Correlated Equilibrium. Inthe Multi-Agent MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey benchmarks,AgentMixer outperforms or matches state-of-the-art methods.