Abstract
Self-interested individuals often fail to cooperate, posing a fundamentalchallenge for multi-agent learning. How can we achieve cooperation amongself-interested, independent learning agents? Promising recent work has shownthat in certain tasks cooperation can be established between learning-awareagents who model the learning dynamics of each other. Here, we present thefirst unbiased, higher-derivative-free policy gradient algorithm forlearning-aware reinforcement learning, which takes into account that otheragents are themselves learning through trial and error based on multiple noisytrials. We then leverage efficient sequence models to condition behavior onlong observation histories that contain traces of the learning dynamics ofother agents. Training long-context policies with our algorithm leads tocooperative behavior and high returns on standard social dilemmas, including achallenging environment where temporally-extended action coordination isrequired. Finally, we derive from the iterated prisoner's dilemma a novelexplanation for how and when cooperation arises among self-interestedlearning-aware agents.