Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Abstract

In this work, we study the sample complexity problem of risk-sensitiveReinforcement Learning (RL) with a generative model, where we aim to maximizethe Conditional Value at Risk (CVaR) with risk tolerance level $\tau$ at eachstep, a criterion we refer to as Iterated CVaR. We first build a connectionbetween Iterated CVaR RL and $(s, a)$-rectangular distributional robust RL witha specific uncertainty set for CVaR. We establish nearly matching upper andlower bounds on the sample complexity of this problem. Specifically, we firstprove that a value iteration-based algorithm, ICVaR-VI, achieves an$\epsilon$-optimal policy with at most $\tilde{O}\left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2} \right)$ samples, where $\gamma$is the discount factor, and $S, A$ are the sizes of the state and actionspaces. Furthermore, when $\tau \geq \gamma$, the sample complexity improves to$\tilde{O} \left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right)$. We further show aminimax lower bound of $\tilde{O} \left(\frac{(1-\gamma\tau)SA}{(1-\gamma)^4\tau\epsilon^2} \right)$. For a fixed risk level $\tau \in(0,1]$, our upper and lower bounds match, demonstrating the tightness andoptimality of our analysis. We also investigate a limiting case with a smallrisk level $\tau$, called Worst-Path RL, where the objective is to maximize theminimum possible cumulative reward. We develop matching upper and lower boundsof $\tilde{O} \left(\frac{SA}{p_{\min}} \right)$, where $p_{\min}$ denotes theminimum non-zero reaching probability of the transition kernel.