When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Abstract

Vision encoders typically generate a large number of visual tokens, providinginformation-rich representations but significantly increasing computationaldemands. This raises the question of whether all generated tokens are equallyvaluable or if some of them can be discarded to reduce computational costswithout compromising quality. In this paper, we introduce a new method fordetermining feature utility based on the idea that less valuable features canbe reconstructed from more valuable ones. We implement this concept byintegrating an autoencoder with a Gumbel-Softmax selection mechanism, thatallows identifying and retaining only the most informative visual tokens. Tovalidate our approach, we compared the performance of the LLaVA-NeXT model,using features selected by our method with randomly selected features. We foundthat on OCR-based tasks, more than 50% of the visual context can be removedwith minimal performance loss, whereas randomly discarding the same proportionof features significantly affects the model capabilities. Furthermore, ingeneral-domain tasks, even randomly retaining only 30% of tokens achievesperformance comparable to using the full set of visual tokens. Our resultshighlight a promising direction towards adaptive and efficient multimodalpruning that facilitates scalable and low-overhead inference withoutcompromising performance.