SQuat: Subspace-orthogonal KV Cache Quantization

Abstract

The key-value (KV) cache accelerates LLMs decoding by storing KV tensors frompreviously generated tokens. It reduces redundant computation at the cost ofincreased memory usage. To mitigate this overhead, existing approaches compressKV tensors into lower-bit representations; however, quantization errors canaccumulate as more tokens are generated, potentially resulting in undesiredoutputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cachequantization). It first constructs a subspace spanned by query tensors tocapture the most critical task-related information. During key tensorquantization, it enforces that the difference between the (de)quantized andoriginal keys remains orthogonal to this subspace, minimizing the impact ofquantization errors on the attention mechanism's outputs. SQuat requires nomodel fine-tuning, no additional calibration dataset for offline learning, andis grounded in a theoretical framework we develop. Through numericalexperiments, we show that our method reduces peak memory by 2.17 to 2.82,improves throughput by 2.45 to 3.60, and achieves more favorable benchmarkscores than existing KV cache quantization algorithms.