Abstract
Tool-integrated reasoning (TIR) augments large language models (LLMs) withthe ability to invoke external tools, such as search engines and codeinterpreters, to solve tasks beyond the capabilities of language-onlyreasoning. While reinforcement learning (RL) has shown promise in improving TIRby optimizing final answer correctness, existing approaches often overlook theefficiency and cost associated with tool usage. This can lead to suboptimalbehavior, including excessive tool calls that increase computational andfinancial overhead, or insufficient tool use that compromises answer quality.In this work, we propose Optimal Tool Call-controlled Policy Optimization(OTC-PO), a simple yet effective RL-based framework that encourages models toproduce accurate answers with minimal tool calls. Our method introduces atool-integrated reward that jointly considers correctness and tool efficiency,promoting high tool productivity. We instantiate this framework within bothProximal Policy Optimization (PPO) and Group Relative Preference Optimization(GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 andQwen-Math across multiple QA benchmarks show that our approach reduces toolcalls by up to 73.1\% and improves tool productivity by up to 229.4\%, whilemaintaining comparable answer accuracy. To the best of our knowledge, this isthe first RL-based framework that explicitly optimizes tool-use efficiency inTIR.