Scaling 4D Representations

  • 2024-12-19 18:59:51
  • João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman
  • 0

Abstract

Scaling has not yet been convincingly demonstrated for pure self-supervisedlearning from video. However, prior work has focused evaluations onsemantic-related tasks $\unicode{x2013}$ action classification, ImageNetclassification, etc. In this paper we focus on evaluating self-supervisedlearning on non-semantic vision tasks that are more spatial (3D) and temporal(+1D = 4D), such as camera pose estimation, point and object tracking, anddepth estimation. We show that by learning from very large video datasets,masked auto-encoding (MAE) with transformer video models actually scales,consistently improving performance on these 4D tasks, as model size increasesfrom 20M all the way to the largest by far reported self-supervised video model$\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison withmany recent image and video models demonstrates the benefits of scaling 4Drepresentations.