Well-read Students Learn Better: On The Importance Of Pre-training Compact Models | Awesome LLM Papers Contribute to Awesome LLM Papers

Well-read Students Learn Better: On The Importance Of Pre-training Compact Models

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova . Arxiv 2019 – 439 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Efficiency Fine Tuning Model Architecture Training Techniques

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

Similar Work