Abstract
Vision Language Models (VLMs) have lead to major improvements in multimodalreasoning, yet they still struggle to understand user-specific concepts.Existing personalization methods address this limitation but heavily rely ontraining procedures, that can be either costly or unpleasant to individualusers. We depart from existing work, and for the first time explore thetraining-free setting in the context of personalization. We propose a novelmethod, Retrieval and Reasoning for Personalization (R2P), leveraging internalknowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint,i.e., key attributes uniquely defining the concept within its semantic class.When a query arrives, the most similar fingerprints are retrieved and scoredvia chain-of-thought-reasoning. To reduce the risk of hallucinations, thescores are validated through cross-modal verification at the attribute level:in case of a discrepancy between the scores, R2P refines the conceptassociation via pairwise multimodal matching, where the retrieved fingerprintsand their images are directly compared with the query. We validate R2P on twopublicly available benchmarks and a newly introduced dataset, Personal Conceptswith Visual Ambiguity (PerVA), for concept identification highlightingchallenges in visual ambiguity. R2P consistently outperforms state-of-the-artapproaches on various downstream tasks across all benchmarks. Code will beavailable upon acceptance.