News Summarization And Evaluation In The Era Of GPT-3 | Awesome LLM Papers Add your paper to Awesome LLM Papers

News Summarization And Evaluation In The Era Of GPT-3

Tanya Goyal, Junyi Jessy Li, Greg Durrett . Arxiv 2022 – 173 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Datasets Evaluation Fine Tuning Interdisciplinary Approaches Model Architecture Multimodal Semantic Representation Prompting

The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant fine-tuning approaches compare to prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and prompt-based models across 4 standard summarization benchmarks, (b) 1K human preference judgments comparing different systems for generic- and keyword-based summarization.

Similar Work