Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Jun 14, 2024·
Nameer Hirschkind
,
Xiao Yu
,
Mahesh Kumar Nandwana
,
Joseph Liu
Eloi Du Bois
Eloi Du Bois
,
Dao Le
,
Nicolas Thiebaut
,
Colin Sinclair
,
Kyle Spence
,
Charles Shang
,
Zoe Abrams
,
Morgan McGuire
· 0 min read
Abstract
We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker’s voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23% each and speaker similarity by 5% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5× faster than real-time.
Type
Publication
In Interspeech 2024