Abstract
As large language models (LLMs) have shown great success in many tasks, theyare used in various applications. While a lot of works have focused on theefficiency of single-LLM application (e.g., offloading, request scheduling,parallelism strategy selection), multi-LLM applications receive less attention,particularly in offline inference scenarios. In this work, we aim to improvethe offline end-to-end inference efficiency of multi-LLM applications in thesingle-node multi-GPU environment. The problem involves two key decisions: (1)determining which LLMs to run concurrently each time (we may not run all themodels at the same time), and (2) selecting a parallelism strategy to use foreach LLM. This problem is NP-hard. Naive solutions may not work well becausethe running time for a model to complete a set of requests depends on therequest workload and the selected parallelism strategy, and they lack anaccurate model of the running time. As the LLM output lengths are unknownbefore running, to estimate the model running time, we propose asampling-then-simulation method which first estimates the output lengths bysampling from an empirical cumulative function we obtained from a large datasetin advance, and then simulates the LLM inference process accordingly. Based onthe simulation, we estimate the per-iteration latencys to get the totallatency. A greedy method is proposed to optimize the scheduling of the LLMs inthe application across the GPUs. We then propose a framework SamuLLM whichcontains two phases: planning, which calls the greedy method for an applicationand running, which runs the application and dynamically adjust the modelscheduling based on the runtime information. Experiments on 3 applications anda mixed application show that SamuLLM can achieve 1.0-2.4$\times$ end-to-endspeedups compared to the competitors.