Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
We test if the generated speech conveys the correct target emotion, and EMORL-TTS achieves higher accuracy across all compared models.
| Emotion | CosyVoice2 | EmoSpeech | EMORL w/o GRPO | EMORL |
|---|---|---|---|---|
| Neutral | ||||
| Angry | ||||
| Happy | ||||
| Sad | ||||
| Surprise |
| Emotion | CosyVoice2 | EmoSpeech | EMORL w/o GRPO | EMORL |
|---|---|---|---|---|
| Neutral | ||||
| Angry | ||||
| Happy | ||||
| Sad | ||||
| Surprise |
We evaluate whether weak, medium, and strong emotions can be clearly distinguished, and EMORL-TTS provides more fine-grained and robust control than all other systems.
| Intensity | Relative Attribute | EmoSphere++ | EMORL |
|---|---|---|---|
| Weak | |||
| Medium | |||
| Strong |
| Intensity | Relative Attribute | EmoSphere++ | EMORL |
|---|---|---|---|
| Weak | |||
| Medium | |||
| Strong |
| Intensity | Relative Attribute | EmoSphere++ | EMORL |
|---|---|---|---|
| Weak | |||
| Medium | |||
| Strong |
| Intensity | Relative Attribute | EmoSphere++ | EMORL |
|---|---|---|---|
| Weak | |||
| Medium | |||
| Strong |
We check if emphasis is perceived at the intended words (with emphasized words marked by * in this demo page), and EMORL-TTS delivers clearer and more reliable emphasis across different emotions than all compared approaches; in the surprise emotion, however, emphasis control is relatively weaker, since adding emphasis can cause misclassification into other emotions, and the RL stage prioritizes emotion accuracy as the core reward.
| Emotion | CosyVoice2 | EME-TTS | EMORL |
|---|---|---|---|
| Neutral | |||
| Angry | |||
| Happy | |||
| Sad | |||
| Surprise |
| Emotion | CosyVoice2 | EME-TTS | EMORL |
|---|---|---|---|
| Neutral | |||
| Angry | |||
| Happy | |||
| Sad | |||
| Surprise |
| Emotion | CosyVoice2 | EME-TTS | EMORL |
|---|---|---|---|
| Neutral | |||
| Angry | |||
| Happy | |||
| Sad | |||
| Surprise |
| Emotion | CosyVoice2 | EME-TTS | EMORL |
|---|---|---|---|
| Neutral | |||
| Angry | |||
| Happy | |||
| Sad | |||
| Surprise |
We assess MOS and NISQA scores, showing that EMORL-TTS surpasses all baselines in controllability while still preserving the high synthesis quality and naturalness unique to LLM-based TTS.
Sentences:
1. All smile were real and the happier the more sincere.
2. Monster made a deep bow.
3. They'd never know I'd regular ran away.
| Sentence | CosyVoice2 | EmoSpeech | EmoSphere++ | Spark-TTS | EMORL w/o GRPO | EMORL |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
As an extended study within EMORL-TTS, we investigate how emphasis placement affects perceived emotion intensity, and find from EMORL-generated speech that emphasizing adverbs brings the most significant enhancement of emotion intensity.
| Emotion | No Emphasis | *extremely*(Adverb) | *after*(Other) | *hearing*(Verb) | *good*(Adjective) | *news*(Noun) |
|---|---|---|---|---|---|---|
| Angry | ||||||
| Happy | ||||||
| Sad | ||||||
| Surprise |
We will open-source our code upon paper acceptance.
Hope you enjoy this research! ^_^