Aligntts: Efficient Feed-forward Text-to-speech System Without Explicit Alignment | Awesome LLM Papers Contribute to Awesome LLM Papers

Aligntts: Efficient Feed-forward Text-to-speech System Without Explicit Alignment

Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, Jing Xiao . ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020 – 60 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
ICASSP Uncategorized

Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.Instead of adopting the attention mechanism in Transformer TTS to align text to mel-spectrum, the alignment loss is presented to consider all possible alignments in training by use of dynamic programming. Experiments on the LJSpeech dataset show that our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.

Similar Work