The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Abstract

Sparse attention offers a promising strategy to extend long-contextcapabilities in Transformer LLMs, yet its viability, its efficiency-accuracytrade-offs, and systematic scaling studies remain unexplored. To address thisgap, we perform a careful comparison of training-free sparse attention methodsat varying model scales, sequence lengths, and sparsity levels on a diversecollection of long-sequence tasks-including novel ones that rely on naturallanguage while remaining controllable and easy to evaluate. Based on ourexperiments, we report a series of key findings: 1) an isoFLOPS analysisreveals that for very long sequences, larger and highly sparse models arepreferable to smaller and dense ones. 2) The level of sparsity attainable whilestatistically guaranteeing accuracy preservation is higher during decoding thanprefilling, and correlates with model size in the former. 3) There is no clearstrategy that performs best across tasks and phases, with different units ofsparsification or budget adaptivity needed for different scenarios. Evenmoderate sparsity levels often result in significant performance degradation onat least one task, highlighting that sparse attention is not a universalsolution. 4) We introduce and validate novel scaling laws specifically tailoredfor sparse attention, providing evidence that our findings are likely to holdtrue beyond our range of experiments. Through these insights, we demonstratethat sparse attention is a key tool to enhance the capabilities of TransformerLLMs for processing longer sequences, but requires careful evaluation oftrade-offs for performance-sensitive applications.