Abstract
As robotic technologies advancing towards more complex multimodalinteractions and manipulation tasks, the integration of advancedVision-Language Models (VLMs) has become a key driver in the field. Despiteprogress with current methods, challenges persist in fusing depth and RGBinformation within 3D environments and executing tasks guided by linguisticinstructions. In response to these challenges, we have enhanced the existingRoboFlamingo framework by introducing RoboFlamingo-Plus, which incorporatesdepth data into VLMs to significantly improve robotic manipulation performance.Our research achieves a nuanced fusion of RGB and depth information byintegrating a pre-trained Vision Transformer (ViT) with a resampling technique,closely aligning this combined data with linguistic cues for superiormultimodal understanding. The novelty of RoboFlamingo-Plus lies in itsadaptation of inputs for depth data processing, leveraging a pre-trainedresampler for depth feature extraction, and employing cross-attentionmechanisms for optimal feature integration. These improvements allowRoboFlamingo-Plus to not only deeply understand 3D environments but also easilyperform complex, language-guided tasks in challenging settings. Experimentalresults show that RoboFlamingo-Plus boosts robotic manipulation by 10-20% overcurrent methods, marking a significant advancement. Codes and model weights arepublic at RoboFlamingo-Plus.