Switchhead: Accelerating Transformers With Mixture-of-experts Attention | Awesome LLM Papers Contribute to Awesome LLM Papers

Switchhead: Accelerating Transformers With Mixture-of-experts Attention

Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber . No Venue 2023

[Paper] [Paper]   Search on Google Scholar   Search on Semantic Scholar
Model Architecture

The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead - a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. SwitchHead uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE “SwitchAll” Transformer model. Our code is public.

https://huggingface.co/discussions/paper/657a68821ccc3c2a5ea66870

Similar Work