Abstract
Retrieval-augmented generation (RAG), which combines large language models(LLMs) with retrievals from external knowledge databases, is emerging as apopular approach for reliable LLM serving. However, efficient RAG servingremains an open challenge due to the rapid emergence of many RAG variants andthe substantial differences in workload characteristics across them. In thispaper, we make three fundamental contributions to advancing RAG serving. First,we introduce RAGSchema, a structured abstraction that captures the wide rangeof RAG algorithms, serving as a foundation for performance optimization.Second, we analyze several representative RAG workloads with distinctRAGSchema, revealing significant performance variability across theseworkloads. Third, to address this variability and meet diverse performancerequirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), asystem optimization framework for efficient RAG serving. Our evaluation showsthat RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction intime-to-first-token latency compared to RAG systems built on LLM-systemextensions.