Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

Abstract

Large Language Models (LLMs) have shown impressive capabilities intransforming natural language questions about relational databases into SQLqueries. Despite recent improvements, small LLMs struggle to handle questionsinvolving multiple tables and complex SQL patterns under a Zero-Shot Learning(ZSL) setting. Supervised Fine-Tuning (SFT) partially compensate the knowledgedeficits in pretrained models but falls short while dealing with queriesinvolving multi-hop reasoning. To bridge this gap, different LLM trainingstrategies to reinforce reasoning capabilities have been proposed, ranging fromleveraging a thinking process within ZSL, including reasoning traces in SFT, oradopt Reinforcement Learning (RL) strategies. However, the influence ofreasoning on Text2SQL performance is still largely unexplored. This paperinvestigates to what extent LLM reasoning capabilities influence their Text2SQLperformance on four benchmark datasets. To this end, it considers the followingLLM settings: (1) ZSL, including general-purpose reasoning or not; (2) SFT,with and without task-specific reasoning traces; (3) RL, leveraging executionaccuracy as primary reward function; (4) SFT+RL, i.e, a two-stage approach thatcombines SFT and RL. The results show that general-purpose reasoning under ZSLproves to be ineffective in tackling complex Text2SQL cases. Small LLMs benefitfrom SFT with reasoning much more than larger ones, bridging the gap of their(weaker) model pretraining. RL is generally beneficial across all tested modelsand datasets, particularly when SQL queries involve multi-hop reasoning andmultiple tables. Small LLMs with SFT+RL excel on most complex datasets thanksto a strategic balance between generality of the reasoning process andoptimization of the execution accuracy. Thanks to RL, the7B Qwen-Coder-2.5model performs on par with 100+ Billion ones on the Bird dataset.