Abstract
This work addresses the problem of novel view synthesis in diverse scenesfrom small collections of RGB images. We propose ERUPT (Efficient Renderingwith Unposed Patch Transformer) a state-of-the-art scene reconstruction modelcapable of efficient scene rendering using unposed imagery. We introducepatch-based querying, in contrast to existing pixel-based queries, to reducethe compute required to render a target view. This makes our model highlyefficient both during training and at inference, capable of rendering at 600fps on commercial hardware. Notably, our model is designed to use a learnedlatent camera pose which allows for training using unposed targets in datasetswith sparse or inaccurate ground truth camera pose. We show that our approachcan generalize on large real-world data and introduce a new benchmark dataset(MSVS-1M) for latent view synthesis using street-view imagery collected fromMapillary. In contrast to NeRF and Gaussian Splatting, which require denseimagery and precise metadata, ERUPT can render novel views of arbitrary sceneswith as few as five unposed input images. ERUPT achieves better rendered imagequality than current state-of-the-art methods for unposed image synthesistasks, reduces labeled data requirements by ~95\% and decreases computationalrequirements by an order of magnitude, providing efficient novel view synthesisfor diverse real-world scenes.