Abstract
Large language models (LLMs) have proven to be highly effective acrossvarious natural language processing tasks. However, their large number ofparameters poses significant challenges for practical deployment. Pruning, atechnique aimed at reducing the size and complexity of LLMs, offers a potentialsolution by removing redundant components from the network. Despite the promiseof pruning, existing methods often struggle to achieve substantial end-to-endLLM inference speedup. In this paper, we introduce SLEB, a novel approachdesigned to streamline LLMs by eliminating redundant transformer blocks. Wechoose the transformer block as the fundamental unit for pruning, because LLMsexhibit block-level redundancy with high similarity between the outputs ofneighboring blocks. This choice allows us to effectively enhance the processingspeed of LLMs. Our experimental results demonstrate that SLEB outperformsprevious LLM pruning methods in accelerating LLM inference while alsomaintaining superior perplexity and accuracy, making SLEB as a promisingtechnique for enhancing the efficiency of LLMs. The code is available at:https://github.com/jiwonsong-dev/SLEB.