Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation

Abstract

Model-based offline Reinforcement Learning (RL) constructs environment modelsfrom offline datasets to perform conservative policy optimization. Existingapproaches focus on learning state transitions through ensemble models,rollouting conservative estimation to mitigate extrapolation errors. However,the static data makes it challenging to develop a robust policy, and offlineagents cannot access the environment to gather new data. To address thesechallenges, we introduce Model-based Offline Reinforcement learning withAdversariaL data augmentation (MORAL). In MORAL, we replace the fixed horizonrollout by employing adversaria data augmentation to execute alternatingsampling with ensemble models to enrich training data. Specifically, thisadversarial process dynamically selects ensemble models against policy forbiased sampling, mitigating the optimistic estimation of fixed models, thusrobustly expanding the training data for policy optimization. Moreover, adifferential factor is integrated into the adversarial process forregularization, ensuring error minimization in extrapolations. Thisdata-augmented optimization adapts to diverse offline tasks without rollouthorizon tuning, showing remarkable applicability. Extensive experiments on D4RLbenchmark demonstrate that MORAL outperforms other model-based offline RLmethods in terms of policy learning and sample efficiency.