Abstract
Attention mechanisms that confer selective focus on a strict subset of inputelements are nearly ubiquitous in language models today. We posit there to bedownside to the use of attention: most input information is lost. In support ofthis idea we observe poor input representation accuracy in transformers andmore accurate representation in what we term masked mixers, which replaceself-attention with masked convolutions. The masked mixer learns causallanguage modeling more efficiently than early transformer implementations andeven outperforms optimized, current transformers when training on small($n_{ctx}<512$) but not larger context windows. Evidence is presented for thehypothesis that differences in transformer and masked mixer trainingefficiencies for various tasks are best predicted by input representationaccuracy, or equivalently global invertibility. We hypothesize that theinformation loss exhibited by transformers would be more detrimental toretrieval than generation, as the former is more closely approximated by abijective and thus invertible function. We find that masked mixers are moreeffective retrieval models both when the pretrained embedding model isunchanged as well as when the embedding model is modified via cosinesimilarity-based InfoNCE loss minimization. A small masked mixer is shown tooutperform a large and near state-of-the-art transformer-based retrieval model,despite the latter being trained with many orders of magnitude more data andcompute.