Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs

Abstract

Recent large language models (LLMs) face increasing inference latency asinput context length and model size continue to grow. In particular, theretrieval-augmented generation (RAG) technique, which enhances LLM responses byincorporating external knowledge, exacerbates this issue by significantlyincreasing the number of input tokens. This expansion in token length leads toa substantial rise in computational overhead, particularly during the prefillstage, resulting in prolonged time-to-first-token (TTFT). To address thisissue, this paper proposes a method to reduce TTFT by leveraging a disk-basedkey-value (KV) cache to lessen the computational burden during the prefillstage. We also introduce a disk-based shared KV cache management system, calledShared RAG-DCache, for multi-instance LLM RAG service environments. Thissystem, together with an optimal system configuration, improves both throughputand latency under given resource constraints. Shared RAG-DCache exploits thelocality of documents related to user queries in RAG, as well as the queueingdelay in LLM inference services. It proactively generates and stores disk KVcaches for query-related documents and shares them across multiple LLMinstances to enhance inference performance. In experiments on a single hostequipped with 2 GPUs and 1 CPU, Shared RAG-DCache achieved a 15~71% increase inthroughput and up to a 12~65% reduction in latency, depending on the resourceconfiguration.