DiFlow-TTS

Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

Interspeech 2026 (Long Paper Track)
Ngoc-Son Nguyen1    Thanh V. T. Tran1    Hieu-Nghia Huynh-Nguyen1    Truong-Son Hy2    Van Nguyen1,†
1. FPT Software AI Center     2. University of Alabama at Birmingham
† Corresponding author
arxiv Paper Code

💡 Abstract

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.


🔧 Method

Overview of DiFlow-TTS

Figure 1: Overview of DiFlow-TTS. A Codec Encoder decomposes the speech prompt into a speaker embedding, prosody, and acoustic tokens for zero-shot style transfer, while the Phoneme-Content Mapper converts text into content tokens and content embeddings. Conditioned on the content embeddings and the speaker identity, prosody, and acoustic tokens extracted from the speech prompt, the Factorized Discrete Flow Denoiser simultaneously generates prosody and acoustic tokens. Finally, the generated tokens together with the speaker embedding are fed into the Codec Decoder to reconstruct the waveform.

Detailed architecture of DiFlow-TTS

Figure 2: Detailed architecture of DiFlow-TTS. We formulate zero-shot TTS as token prediction over a factorized codec token space. Speech is tokenized by the (a) Speech Tokenizer into content, prosody, and acoustic tokens along with a speaker embedding. Built on these tokens, we design a framework comprising two main modules: (b) Phoneme-Content Mapper, which maps input phonemes to discrete content tokens and generates corresponding content embeddings; and (c) Factorized Discrete Flow Denoiser, which performs discrete flow matching conditioned on the speaker embedding, the discrete prosody and acoustic tokens derived from the speech prompt, and the content embeddings.


🎯 Challenges

Existing zero-shot TTS paradigms each hit a different wall:

🐢High-latency autoregressive generation

Synthesizing tokens step-by-step makes inference inherently slow.

⛓️Diffusion ties training to sampling

Tightly coupling the training and sampling processes restricts sampling configurations to those established during training.

🌀Continuous space complexity

Operating over continuous, high-dimensional, unbounded representations complicates density estimation and invites out-of-distribution artifacts.


✨ Quantitative Results

Performance on LibriSpeech test-clean

ModelData (hrs)UTMOS ↑WER ↓SIM-O ↑ F0 Acc. ↑F0 RMSE ↓Energy Acc. ↑Energy RMSE ↓
Ground Truth-4.100.02-----
VoiceCraftGS (9K)3.550.180.510.7817.220.440.010
VALL-ELT (500)3.680.190.400.7521.660.360.020
NaturalSpeech 2LT (585)2.380.090.310.8015.620.250.020
F5-TTSLT (500)3.760.240.520.8013.780.670.010
F5-TTSE (100K)3.720.090.660.8312.660.660.010
OZSpeechLT (500)3.150.050.400.8111.960.670.010
MaskGCTE (100K)3.830.090.670.7714.330.750.007
DiFlow-TTS (Ours)LT (470)3.980.050.450.887.970.730.007

Bold = best per column. Results use 3-second audio prompts; DiFlow-TTS is evaluated with 128 NFE and trained on only 470 hours — far less data than every baseline.

Model Size & Inference Latency

Model#Params ↓NFERTF ↓UTMOS ↑WER ↓SIM-O ↑F0 RMSE ↓Energy RMSE ↓
VoiceCraft830M-1.703.550.180.5117.220.010
VALL-E594M-0.863.680.190.4021.660.020
NaturalSpeech 2378M2001.662.380.090.3115.620.020
F5-TTS336M320.263.720.090.6612.660.010
OZSpeech145M10.033.150.050.4011.960.010
MaskGCT1.43B50+45*0.463.830.090.6714.330.007
DiFlow-TTS-Small (Ours)122M40.033.340.060.438.310.007
160.053.890.050.458.580.008
DiFlow-TTS (Ours)164M40.033.310.050.448.050.007
160.073.860.050.457.960.007

* MaskGCT is a two-stage system that first predicts masked semantic tokens, then infers masked acoustic tokens. Bold = best per column. DiFlow-TTS-Small is up to 11.7× smaller than MaskGCT and up to 34× faster than VoiceCraft at comparable or better quality.

Best Naturalness 🏆
Best Content Accuracy 🛡️
Strong Prosody Accuracy 🥇
Data Efficiency 📊

🔊 Qualitative Results

All audio samples on this demo page were generated by DiFlow-TTS (NFE=128), trained on 470 hours of the LibriTTS dataset.

LibriSpeech test-clean

# Transcription Prompt Synthesized Speech
1 As soon as these dispositions were made, the scout turned to David and gave him his parting instructions.
2 The task will not be difficult, returned David, hesitating, though I greatly fear your presence would rather increase than mitigate his unhappy fortunes.
3 In both these high mythical subjects, the surrounding nature, though suffering, is still dignified and beautiful.
4 Keswick, March twenty second, eighteen thirty seven. Dear Madam.
5 The meter continued in general service during eighteen ninety nine and probably up to the close of the century.
6 As used in the speech of everyday life, the word carries an undertone of deprecation.
7 That is the best way to decide, for the spear will always point somewhere, and one thing is as good as another.
8 There came upon me a sudden shock when I heard these words, which exceeded anything which I had yet felt.
9 We are quite satisfied now, Captain Battleax, said my wife.
10 As he flew, his down-reaching, clutching talons were not half a yard above the fugitive's head.

Celebrities

Celebrity Transcription Prompt Synthesized Speech
Elon Musk When something is important enough, you do it even if the odds are not in your favor.
Jensen Huang Technology has transformed the way we communicate with each other around the world.
Mark Zuckerberg Artificial intelligence is transforming industries across the globe at an unprecedented pace.
Donald Trump The weather outside is bright and clear, perfect for a walk in the park.

⚠️ Disclaimer: The audio samples provided above are for academic purposes only and are intended to demonstrate technical capabilities.