Twhin-bert: A Socially-enriched Pre-trained Language Model For Multilingual Tweet Representations At Twitter

Xinyang Zhang, Yury Malkov, Omar Florez, Serim Park, Brian McWilliams, Jiawei Han, Ahmed El-Kishky . KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2023 – 44 citations

[Paper]
Applications Compositional Generalization Datasets Evaluation Image Text Integration Interdisciplinary Approaches KDD Model Architecture Multimodal Semantic Representation Training Techniques

Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from the popular social network. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages, providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on various multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.

Awesome LLM Papers

Stay Updated

Twhin-bert: A Socially-enriched Pre-trained Language Model For Multilingual Tweet Representations At Twitter

Xinyang Zhang, Yury Malkov, Omar Florez, Serim Park, Brian McWilliams, Jiawei Han, Ahmed El-Kishky . KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2023 – 44 citations

Similar Work