Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Abstract

Recent advances in DUSt3R have enabled robust estimation of dense pointclouds and camera parameters of static scenes, leveraging Transformer networkarchitectures and direct supervision on large-scale 3D datasets. In contrast,the limited scale and diversity of available 4D datasets present a majorbottleneck for training a highly generalizable 4D model. This constraint hasdriven conventional 4D methods to fine-tune 3D models on scalable dynamic videodata with additional geometric priors such as optical flow and depths. In thiswork, we take an opposite path and introduce Easi3R, a simple yet efficienttraining-free method for 4D reconstruction. Our approach applies attentionadaptation during inference, eliminating the need for from-scratch pre-trainingor network fine-tuning. We find that the attention layers in DUSt3R inherentlyencode rich information about camera and object motion. By carefullydisentangling these attention maps, we achieve accurate dynamic regionsegmentation, camera pose estimation, and 4D dense point map reconstruction.Extensive experiments on real-world dynamic videos demonstrate that ourlightweight attention adaptation significantly outperforms previousstate-of-the-art methods that are trained or finetuned on extensive dynamicdatasets. Our code is publicly available for research purpose athttps://easi3r.github.io/