Abstract
Multimodal incremental learning needs to digest the information from multiplemodalities while concurrently learning new knowledge without forgetting thepreviously learned information. There are numerous challenges for this task,mainly including the larger storage size of multimodal data in exemplar-basedmethods and the computational requirement of finetuning on huge multimodalmodels. In this paper, we leverage the parameter-efficient tuning scheme toreduce the burden of fine-tuning and propose the exemplar masking framework toefficiently replay old knowledge. Specifically, the non-important tokens aremasked based on the attention weights and the correlation across differentmodalities, significantly reducing the storage size of an exemplar andconsequently saving more exemplars under the same memory buffer. Moreover, wedesign a multimodal data augmentation technique to diversify exemplars forreplaying prior knowledge. In experiments, we not only evaluate our method inexisting multimodal datasets but also extend the ImageNet-R dataset to amultimodal dataset as a real-world application, where captions are generated byquerying multimodal large language models (e.g., InstructBLIP). Extensiveexperiments show that our exemplar masking framework is more efficient androbust to catastrophic forgetting under the same limited memory buffer. Code isavailable at https://github.com/YiLunLee/Exemplar_Masking_MCIL.