Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Jun 14, 2024·

Nameer Hirschkind

Xiao Yu

Mahesh Kumar Nandwana

Joseph Liu

Eloi Du Bois

Dao Le

Nicolas Thiebaut

Colin Sinclair

Kyle Spence

Charles Shang

Zoe Abrams

Morgan McGuire

· 0 min read

PDF Cite

Abstract

We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker’s voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23% each and speaker similarity by 5% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5× faster than real-time.

Type

Conference paper

Publication

In Interspeech 2024

Last updated on Jun 14, 2024

Audio Speech Deep Learning

Authors

Eloi Du Bois

Principal Engineer / RnD engineer

Voice Toxicity Detection Using Multi-Task Learning Apr 14, 2024 →