Abstract
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM)series developed upon SPHINX. To improve the architecture and trainingefficiency, we modify the SPHINX framework by removing redundant visualencoders, bypassing fully-padded sub-images with skip tokens, and simplifyingmulti-stage training into a one-stage all-in-one paradigm. To fully unleash thepotential of MLLMs, we assemble a comprehensive multi-domain and multimodaldataset covering publicly available resources in language, vision, andvision-language tasks. We further enrich this collection with our curated OCRintensive and Set-of-Mark datasets, extending the diversity and generality. Bytraining over different base LLMs including TinyLlama1.1B, InternLM2-7B,LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary inparameter size and multilingual capabilities. Comprehensive benchmarkingreveals a strong correlation between the multi-modal performance with the dataand parameter scales. Code and models are released athttps://github.com/Alpha-VLLM/LLaMA2-Accessory