Structured 3D Latents for Scalable and Versatile 3D Generation

Abstract

We introduce a novel 3D generation method for versatile and high-quality 3Dasset creation. The cornerstone is a unified Structured LATent (SLAT)representation which allows decoding to different output formats, such asRadiance Fields, 3D Gaussians, and meshes. This is achieved by integrating asparsely-populated 3D grid with dense multiview visual features extracted froma powerful vision foundation model, comprehensively capturing both structural(geometry) and textural (appearance) information while maintaining flexibilityduring decoding. We employ rectified flow transformers tailored for SLAT as our3D generation models and train models with up to 2 billion parameters on alarge 3D asset dataset of 500K diverse objects. Our model generateshigh-quality results with text or image conditions, significantly surpassingexisting methods, including recent ones at similar scales. We showcase flexibleoutput format selection and local 3D editing capabilities which were notoffered by previous models. Code, model, and data will be released.