Abstract
Large language model (LLM) agents need to perform multi-turn interactions inreal-world tasks. However, existing multi-turn RL algorithms for optimizing LLMagents fail to perform effective credit assignment over multiple turns whileleveraging the generalization capabilities of LLMs and it remains unclear howto develop such algorithms. To study this, we first introduce a new benchmark,ColBench, where an LLM agent interacts with a human collaborator over multipleturns to solve realistic tasks in backend programming and frontend design.Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL withStep-WisE Evaluation from Training-time information), that uses a carefullydesigned optimization objective to train a critic model with access toadditional training-time information. The critic provides step-level rewardsfor improving the policy model. Our experiments demonstrate that SWEET-RLachieves a 6% absolute improvement in success and win rates on ColBenchcompared to other state-of-the-art multi-turn RL algorithms, enablingLlama-3.1-8B to match or exceed the performance of GPT4-o in realisticcollaborative content creation.