Prosody Control #983
Replies: 3 comments 1 reply
-
These papers often require a lot of speech data and are not very useful in cases like VOICEVOX ;-) これらの論文は多くの音声データを必要とすることが多く、VOICEVOXのような場合にはあまり使えないことが多いです ;-) |
Beta Was this translation helpful? Give feedback.
-
Awesome, thanks about the paper reference, pretty helpful. So you already know about the style tokens, they can do a lot of things except prosody control, like for present VOICEVOX speakers, they can help achieve a continuous control over styles, like 50% of あまあま and 30% of つん. There should be a reason you're not using them currently, maybe the low resource problem?
Have you actually tried this out in like some experiments? How about some data augmentation or pre-training? I mean I also wonder if this would actually work in low resource speakers. |
Beta Was this translation helpful? Give feedback.
-
Sad 😥, thanks for sharing! |
Beta Was this translation helpful? Give feedback.
-
Hey I'm here just to share one of my knowledge I got from browsing the papers, that besides phoneme level control (pitch, duration, energy, etc.), modern TTS pipelines in industry(arXiv:2110.12612) use both utterance level and word level prosody style tokens, which derive from the global style token(arXiv:1803.09017).
That system made by Microsoft got the first prize in the Blizzard Challenge 2021 (actually, almost every system prompted for the challenge uses style tokens). Whatever acoustic model you're using, you should definitely check it out.
Beta Was this translation helpful? Give feedback.
All reactions