Abstract
We introduce ARPG, a novel visual autoregressive model that enablesrandomized parallel generation, addressing the inherent limitations ofconventional raster-order approaches, which hinder inference efficiency andzero-shot generalization due to their sequential, predefined token generationorder. Our key insight is that effective random-order modeling necessitatesexplicit guidance for determining the position of the next predicted token. Tothis end, we propose a novel guided decoding framework that decouplespositional guidance from content representation, encoding them separately asqueries and key-value pairs. By directly incorporating this guidance into thecausal attention mechanism, our approach enables fully random-order trainingand generation, eliminating the need for bidirectional attention. Consequently,ARPG readily generalizes to zero-shot tasks such as image inpainting,outpainting, and resolution expansion. Furthermore, it supports parallelinference by concurrently processing multiple queries using a shared KV cache.On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only64 sampling steps, achieving over a 20-fold increase in throughput whilereducing memory consumption by over 75% compared to representative recentautoregressive models at a similar scale.