Abstract
Embodied decision-making is fundamental for AI agents operating in real-worldenvironments. While Visual Language Models (VLMs) have advanced thiscapability, they still struggle with complex decisions, particularly inhuman-centered situations that require deep reasoning about human needs andvalues. In this study, we systematically evaluate open-sourced VLMs onmultimodal human-centered decision-making tasks. We find that LLMs receivingonly textual descriptions unexpectedly outperform their VLM counterparts ofsimilar scale that process actual images, suggesting that visual alignment mayhinder VLM abilities. To address this challenge, we propose a novel text-onlytraining approach with synthesized textual data. This method strengthens VLMs'language components and transfers the learned abilities to multimodalinference, eliminating the need for expensive image-text paired data.Furthermore, we show that VLMs can achieve substantial performance gainsthrough self-improvement, using training data generated by their LLMcounterparts rather than relying on larger teacher models like GPT-4. Ourfindings establish a more efficient and scalable approach to enhancing VLMs'human-centered decision-making capabilities, opening new avenues for optimizingVLMs through self-improvement mechanisms.