Abstract
Large language model (LLM)-based applications consist of both LLM and non-LLMcomponents, each contributing to the end-to-end latency. Despite great effortsto optimize LLM inference, end-to-end workflow optimization has beenoverlooked. Existing frameworks employ coarse-grained orchestration with taskmodules, which confines optimizations to within each module and yieldssuboptimal scheduling decisions. We propose fine-grained end-to-endorchestration, which utilizes task primitives as the basic units and representseach query's workflow as a primitive-level dataflow graph. This explicitlyexposes a much larger design space, enables optimizations in parallelizationand pipelining across primitives of different modules, and enhances schedulingto improve application-level performance. We build Teola, a novel orchestrationframework for LLM-based applications that implements this scheme. Comprehensiveexperiments show that Teola can achieve up to 2.09x speedup over existingsystems across various popular LLM applications. The code is available athttps://github.com/NetX-lab/Ayo.