Multi-scale Self-attention For Text Classification | Awesome LLM Papers Contribute to Awesome LLM Papers

Multi-scale Self-attention For Text Classification

Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, Zheng Zhang . Proceedings of the AAAI Conference on Artificial Intelligence 2020 – 59 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
AAAI Datasets Model Architecture

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

Similar Work