Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions

Abstract

In the realm of Large Multi-modal Models (LMMs), the instruction qualityduring the visual instruction tuning stage significantly influences theperformance of modality alignment. In this paper, we assess the instructionquality from a unique perspective termed \textbf{Writing Manner}, whichencompasses the selection of vocabulary, grammar and sentence structure toconvey specific semantics. We argue that there exists a substantial writingmanner gap between the visual instructions and the base Large Language Models(LLMs) within LMMs. This gap forces the pre-trained base LLMs to deviate fromtheir original writing styles, leading to capability degradation of both baseLLMs and LMMs. To bridge the writing manner gap while preserving the originalsemantics, we propose directly leveraging the base LLM to align the writingmanner of soft-format visual instructions with that of the base LLM itself,resulting in novel LLM-aligned instructions. The manual writing mannerevaluation results demonstrate that our approach successfully minimizes thewriting manner gap. By utilizing LLM-aligned instructions, the baseline modelsLLaVA-7B and QwenVL demonstrate enhanced resistance to hallucinations andnon-trivial comprehensive improvements across all $15$ visual and languagebenchmarks.