DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Abstract

Recent multi-teacher distillation methods have unified the encoders ofmultiple foundation models into a single encoder, achieving competitiveperformance on core vision tasks like classification, segmentation, and depthestimation. This led us to ask: Could similar success be achieved when the poolof teachers also includes vision models specialized in diverse tasks acrossboth 2D and 3D perception? In this paper, we define and investigate the problemof heterogeneous teacher distillation, or co-distillation, a challengingmulti-teacher distillation scenario where teacher models vary significantly inboth (a) their design objectives and (b) the data they were trained on. Weexplore data-sharing strategies and teacher-specific encoding, and introduceDUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D humanperception. Our model achieves performance comparable to that of its largerteachers, sometimes even outperforming them, on their respective tasks.Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a muchsmaller encoder.