In recent years, emotional Text-to-Speech (TTS) synthesis and emphasis-controllable speech synthesis have advanced significantly. However, their interaction remains underexplored. We propose Emphasis Meets Emotion TTS (EME-TTS), a novel framework designed to address two key research questions: (1) how to effectively utilize emphasis to enhance the expressiveness of emotional speech, and (2) how to maintain the perceptual clarity and stability of target emphasis across different emotions. EME-TTS employs weakly supervised learning with emphasis pseudo-labels and variance-based emphasis features. Additionally, the proposed Emphasis Perception Enhancement (EPE) block enhances the interaction between emotional signals and emphasis positions. Experimental results show that EME-TTS, when combined with large language models for emphasis position prediction, enables more natural emotional speech synthesis while preserving stable and distinguishable target emphasis across emotions.
We designed a more challenging subjective test, distinct from previous studies on emphasis control. Instead of rating the degree of emphasis at a predefined position, participants were asked to identify the emphasized words in 80 randomly shuffled speech samples. The results indicate that the emphasis produced by our proposed model was clearly perceivable across different emotions.
Emotion | Emphasized Word | EME-TTS w/o EPE | EME-TTS |
---|---|---|---|
Neutral | choosing | ||
Angry | now | ||
Happy | choosing | ||
Sad | wear | ||
Surprise | skirt |
Emotion | Emphasized Word | EME-TTS w/o EPE | EME-TTS |
---|---|---|---|
Neutral | mean | ||
Angry | must | ||
Happy | name | ||
Sad | mean | ||
Surprise | something |
To ensure fairness, all CosyVoice2-generated samples were conditioned on a neutral reference speaker’s audio and a textual emotion prompt, ensuring that only the speaker's identity and emotion labels were provided as input. During inference, our proposed model consistently utilized a large language model to predict suitable emphasis positions, which were then used as input for testing. Results demonstrate that EME-TTS achieves higher emotion accuracy in synthesized speech compared to baseline models, highlighting its overall effectiveness in generating emotionally expressive speech.
Emotion | Emphasized Word | CosyVoice2 | EmoSpeech | EME-TTS w/o EPE | EME-TTS |
---|---|---|---|---|---|
Neutral | right | ||||
Angry | right | ||||
Happy | right | ||||
Sad | way | ||||
Surprise | chose |
Emotion | Emphasized Word | CosyVoice2 | EmoSpeech | EME-TTS w/o EPE | EME-TTS |
---|---|---|---|---|---|
Neutral | gave | ||||
Angry | hurrying | ||||
Happy | nudge | ||||
Sad | nudge | ||||
Surprise | hurrying |
Participants evaluated 30 sets of speech samples, each containing outputs from four models, based on perceived emotional expressiveness. Among them, 10 sets were derived from a short passage with contextual information. Within each set, samples were ranked from 1 (least expressive) to 4 (most expressive). The results indicate that EME-TTS consistently received the highest rankings, especially in the contextualized setting. This suggests that surrounding linguistic context strengthens the semantic foundation for emphasis, further enhancing emotional expressiveness.
Short passage (10 sets):
1. Emma stepped outside and found a small box on her doorstep. (Surprise)
2. She stared at it trying to remember if she had ordered anything. (Neutral)
3. A folded note inside had nothing written. (Neutral)
4. She lifted the lid and gasped as a silver bracelet shimmered in the light. (Surprise)
5. It looked exactly like the one her grandmother used to wear. (Sad)
6. She traced the patterns on it remembering how she had lost it long ago. (Sad)
7. She looked around wondering who could have left it there and why. (Angry)
8. Her heart pounded as frustration bubbled inside her chest. (Angry)
9. Then warmth filled her as she held the bracelet tightly and smiled. (Happy)
10. She placed it on her wrist feeling as if her grandmother was near once more. (Happy)
Sentence | Emphasized Word | CosyVoice2 | EmoSpeech | EME-TTS w/o EPE | EME-TTS |
---|---|---|---|---|---|
Sentence 1 | small | ||||
Sentence 2 | remember | ||||
Sentence 3 | nothing | ||||
Sentence 4 | bracelet | ||||
Sentence 5 | wear | ||||
Sentence 6 | lost | ||||
Sentence 7 | left | ||||
Sentence 8 | pounded | ||||
Sentence 9 | smiled | ||||
Sentence 10 | near |
Thanks for your patience!