Abstract
The unprecedented advancements in Large Language Models (LLMs) haveprofoundly impacted natural language processing but have yet to fully embracethe realm of scalable vector graphics (SVG) generation. While LLMs encodepartial knowledge of SVG data from web pages during training, recent findingssuggest that semantically ambiguous and tokenized representations within LLMsmay result in hallucinations in vector primitive predictions. Additionally, LLMtraining typically lacks modeling and understanding of the rendering sequenceof vector paths, which can lead to occlusion between output vector primitives.In this paper, we present LLM4SVG, an initial yet substantial step towardbridging this gap by enabling LLMs to better understand and generate vectorgraphics. LLM4SVG facilitates a deeper understanding of SVG components throughlearnable semantic tokens, which precisely encode these tokens and theircorresponding properties to generate semantically aligned SVG outputs. Using aseries of learnable semantic tokens, a structured dataset for instructionfollowing is developed to support comprehension and generation across twoprimary tasks. Our method introduces a modular architecture to existing largelanguage models, integrating semantic tags, vector instruction encoders,fine-tuned commands, and powerful LLMs to tightly combine geometric,appearance, and language information. To overcome the scarcity of SVG-textinstruction data, we developed an automated data generation pipeline thatcollected our SVGX-SFT Dataset, consisting of high-quality human-designed SVGsand 580k SVG instruction following data specifically crafted for LLM training,which facilitated the adoption of the supervised fine-tuning strategy popularin LLM development.