Transformers without Normalization

Abstract

Normalization layers are ubiquitous in modern neural networks and have longbeen considered essential. This work demonstrates that Transformers withoutnormalization can achieve the same or better performance using a remarkablysimple technique. We introduce Dynamic Tanh (DyT), an element-wise operation$DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalizationlayers in Transformers. DyT is inspired by the observation that layernormalization in Transformers often produces tanh-like, $S$-shaped input-outputmappings. By incorporating DyT, Transformers without normalization can match orexceed the performance of their normalized counterparts, mostly withouthyperparameter tuning. We validate the effectiveness of Transformers with DyTacross diverse settings, ranging from recognition to generation, supervised toself-supervised learning, and computer vision to language models. Thesefindings challenge the conventional understanding that normalization layers areindispensable in modern neural networks, and offer new insights into their rolein deep networks.