Exploring Phoneme-level Speech Representations For End-to-end Speech Translation

Elizabeth Salesky, Matthias Sperber, Alan W Black . Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019 – 44 citations

[Paper]
ACL Interdisciplinary Approaches Training Techniques

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.

Awesome LLM Papers

Stay Updated

Exploring Phoneme-level Speech Representations For End-to-end Speech Translation

Elizabeth Salesky, Matthias Sperber, Alan W Black . Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019 – 44 citations

Similar Work