Efficient Speech Language Modeling via Energy Distance in Continuous Space

We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into continuous representation sequences and modeling them autoregressively in the latent space using a maximum mean discrepancy objective.

Audio Samples

3s Prefix as Prompt
Reference Utterance as Prompt
Streaming Synthesis

3s Prefix as Prompt

Text: The dews were suffered to exhale, and the sun had dispersed the mists, and was shedding a strong and clear light in the forest, when the travelers resumed their journey.

Prompt Speech:

Synthesized Speech:

Text: Unc knocked at the door of the house and a chubby, pleasant faced woman, dressed all in blue, opened it and greeted the visitors with a smile.

Prompt Speech:

Synthesized Speech:

Text: Now Delia contrived to obtain a great influence and ascendency over the minds of the children by means of these dolls.

Prompt Speech:

Synthesized Speech:

Text: And often has my mother said, While on her lap I laid my head, She feared for time I was not made, But for Eternity.

Prompt Speech:

Synthesized Speech:

Text: He had preconceived ideas about everything, and his idea about Americans was that they should be engineers or mechanics.

Prompt Speech:

Synthesized Speech:

Reference Utterance as Prompt

Text: I greatly mourn that one so well disposed should die in his ignorance, and I have sought a goodly hymn Can you lead me to him?

Prompt Speech:

Synthesized Speech:

Text: In a few hours the examination would commence, and he was still in the dilemma between making the facts public and allowing the culprit to compete for the valuable scholarship.

Prompt Speech:

Synthesized Speech:

Text: Yes, something everything, ' said Rachel, hurriedly, looking frowningly at a flower which she was twirling in her fingers.

Prompt Speech:

Synthesized Speech:

Text: This has indeed been a harassing day, continued the young man, his eyes fixed upon his friend.

Prompt Speech:

Synthesized Speech:

Text: And it is made of mother's best yarn, and she knitted it herself, and everybody wants to get it away from me.

Prompt Speech:

Synthesized Speech:

Streaming Synthesis

Thanks to its efficient architecture, SLED naturally supports streaming synthesis by setting an interleaving ratio between text and speech positions.

In the examples below, the interleaving ratio is set to 5:45, generating a 0.6-second audio segment for every 5 text tokens, matching the CosyVoice2 streaming setup.

Streaming Synthesis

Text: This without reckoning in the pains of the heart. And so it goes on.

Synthesized Speech:

Text: I'll gladly do that, promised the new Boolooroo and I'll feed the honorable goat all the shavings and leather and tin cans he can eat, besides the grass.

Synthesized Speech:

Text: The scout, who had left David at the door, to ascertain they were not observed, thought it prudent to preserve his disguise until assured of their privacy.

Synthesized Speech:

Text: He had a lot of line out, and the place was none too free for a long cast but he was impatient to drop his flies again on the spot where the big fish was feeding.

Synthesized Speech:

Text: Instead of but six regularly affiliated members, and at most two score of adherents, the organization numbers today many hundred thousand souls.

Synthesized Speech: