DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chengxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li

Abstract

Speech diffusion models have seen rapid advancements in recent years. While the widely-used U-net architecture remains prominent, transformer-based models, such as DiT (Diffusion Transformer), have also garnered significant attention. However, existing DiT-based speech models often treat Mel spectrograms merely as generic images, thereby overlooking the specific acoustic properties inherent to speech. To address these limitations, we propose a novel method called DPI-TTS, which leverages a direct patch interaction approach based on DiT. This method is designed to train quickly without sacrificing accuracy. Notably, the low-to-high frequency, frame-by-frame progressive inference approach employed by DPI-TTS aligns more closely with acoustic properties, thus enhancing the naturalness of the synthesized speech. Additionally, we introduce a fine-grained style temporal modeling method, which further improves the speaker style similarity of DPI-TTS. Experimental results demonstrate that our method not only accelerates training speed but also outperforms baselines.

Our code is available here.

DPI-TTS Framework

The overall framework of RIO.

DPI-TTS results on LJSpeech dataset

Sample Speech Text
1 Mrs. De Mohrenschildt thought that Oswald,
2 The prisoner had nothing to deal with but wooden panels, and by dint of cutting and chopping he got both the lower panels out.
3 Examination of the cartridge cases found on the sixth floor of the Depository Building
4 testified that the information available to the Federal Government about Oswald before the assassination would, if known to PRS,
5 It is an easy document to understand when you remember that it was called into being
6 And in many directions, the intervention of that organized control which we call government
7 Calcraft served the city of London till eighteen seventy-four, when he was pensioned at the rate of twenty-five shillings per week.
8 we will not allow ourselves to run around in new circles of futile discussion and debate, always postponing the day of decision.
9 There has never been much science in the system of carrying out the extreme penalty in this country; the "finisher of the law"
10 he had his pockets filled with bread and cheese, and it was generally supposed that he had come a long distance to see the fatal show.

DPI-TTS results on VCTK dataset

Sample Speech Text
1 it has the potential to be another north sea.
2 while they went on holiday, we got the contract.
3 it is a tough game but we have a chance.
4 i learned a lot from her.
5 people tended to stay there for some time.
6 she is currently the permanent secretary at the department of transport.
7 he will never walk the streets again.
8 so, where do we go from here?
9 it is great to have this beautiful new site in wonderful countryside.
10 one policeman was killed.