ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data

Abstract

Online reinforcement learning (RL) with sparse rewards poses a challengepartly because of the lack of feedback on states leading to the goal.Furthermore, expert offline data with reward signal is rarely available toprovide this feedback and bootstrap online learning. How can we guide onlineagents to the right solution without this on-task data? Reward shaping offers asolution by providing fine-grained signal to nudge the policy towards theoptimal solution. However, reward shaping often requires domain knowledge tohand-engineer heuristics for a specific goal. To enable more general andinexpensive guidance, we propose and analyze a data-driven methodology thatautomatically guides RL by learning from widely available video data such asInternet recordings, off-task demonstrations, task failures, and undirectedenvironment interaction. By learning a model of optimal goal-conditioned valuefrom diverse passive data, we open the floor to scaling up and using variousdata sources to model general goal-reaching behaviors relevant to guidingonline RL. Specifically, we use intent-conditioned value functions to learnfrom diverse videos and incorporate these goal-conditioned values into thereward. Our experiments show that video-trained value functions work well witha variety of data sources, exhibit positive transfer from human videopre-training, can generalize to unseen goals, and scale with dataset size.