Abstract
In safe reinforcement learning (RL), auxiliary safety costs are used to alignthe agent to safe decision making. In practice, safety constraints, includingcost functions and budgets, are unknown or hard to specify, as it requiresanticipation of all possible unsafe behaviors. We therefore address a generalsetting where the true safety definition is unknown, and has to be learned fromsparsely labeled data. Our key contributions are: first, we design a safetymodel that performs credit assignment to estimate each decision step's impacton the overall safety using a dataset of diverse trajectories and theircorresponding binary safety labels (i.e., whether the corresponding trajectoryis safe/unsafe). Second, we illustrate the architecture of our safety model todemonstrate its ability to learn a separate safety score for each timestep.Third, we reformulate the safe RL problem using the proposed safety model andderive an effective algorithm to optimize a safe yet rewarding policy. Finally,our empirical results corroborate our findings and show that this approach iseffective in satisfying unknown safety definition, and scalable to variouscontinuous control tasks.