Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints

Abstract

Offline reinforcement learning (RL) aims to find an optimal policy for Markovdecision processes (MDPs) using a pre-collected dataset. In this work, werevisit the linear programming (LP) reformulation of Markov decision processesfor offline RL, with the goal of developing algorithms with optimal$O(1/\sqrt{n})$ sample complexity, where $n$ is the sample size, under partialdata coverage and general function approximation, and with favorablecomputational tractability. To this end, we derive new \emph{error bounds} forboth the dual and primal-dual formulations of the LP, and incorporate themproperly as \emph{constraints} in the LP reformulation. We then show that undera completeness-type assumption, $O(1/\sqrt{n})$ sample complexity can beachieved under standard single-policy coverage assumption, when one properly\emph{relaxes} the occupancy validity constraint in the LP. This framework canreadily handle both infinite-horizon discounted and average-reward MDPs, inboth general function approximation and tabular cases. The instantiation to thetabular case achieves either state-of-the-art or the first sample complexitiesof offline RL in these settings. To further remove any completeness-typeassumption, we then introduce a proper \emph{lower-bound constraint} in the LP,and a variant of the standard single-policy coverage assumption. Such analgorithm leads to a $O(1/\sqrt{n})$ sample complexity with dependence on the\emph{value-function gap}, with only realizability assumptions. Our properlyconstrained LP framework advances the existing results in several aspects, inrelaxing certain assumptions and achieving the optimal $O(1/\sqrt{n})$ samplecomplexity, with simple analyses. We hope our results bring new insights intothe use of LP formulations and the equivalent primal-dual minimax optimizationfor offline RL, through the error-bound induced constraints.