-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train Synthetizer in Spanish #941
Comments
How many different speaker are there? At least 300 is suggested. If you train with a batch size of 12 as you mentioned in #940, maybe the model did not yet converge after 50k steps. Maybe it will need 100k steps+. Moreover, I assume you did learn attention, as you have kinda good quality, but wrong voice? Then the question is: Did you finetune your model? To get good results it is crucial to finetune it for a single speaker, this will vastly improve quality. Have a look on #437 |
What I want to achieve is to be able to clone any unseen voice during training, as the English pretrained model does, but in Spanish. That's why I didn't finetune it. Here is the loss at 50k steps: |
Training loss varies with datasets, but it doesn't look wrong. |
For commonvoice I used the one that comes with it, and generated each .txt for each audio. To solve the problem of training a new encoder for the language, I have tried to clone an audio in English, so it detects it well, and then put the text in Spanish, since the synthesizer is being trained in spanish. |
Ok I only know commonVoice for other languages, there we have multiple .tsv files. Maybe it is different to spanish commonvoice dataset. |
Oh, now I get what you were asking, I used validated.tsv to copy all the audios from the original into a new directory with only the validated ones and from this file also got the .txts |
Did you try to use commonVoice with the code in this repo? What suggestions can you give me about it? I haven't found another dataset with such many speakers as it has, yet. |
CommonVoice should be the best dataset by far, the sheer amount of speakers I did not find anywhere else. |
But I am afraid that the problem lies in the encoder, as the cloning quality depends mainly on the encoder. I remember it was stated in some issue, but did not find it. But this by blue-fish should indicate in this direction:
|
I already tried the pretrained.pt files with the same audio of my voice and it worked, that's why I don't think it may be the encoder, if it was, then with the pretrained.pt wouldn't have worked. As I said before, the target audio is me speaking in English like for 10-11 seconds. Then, the problem should be in the model I get from the synthesizer. I am pretty sure the issue is caused because I don't have enough speakers to train on. Now that you said I should use train.tsv, maybe that was the issue with commonvoice. Did you see my older post here #789 (comment)? All of that was with the validated.tsv, I will try right now with the train.tsv and see how it works. I will try and error until I get this Spanish model! Thank you for being alert! |
By the way, which batch size do you recommend for a RTX 2060 and for a RTX 3060? |
Okay, now I understand you idea. If it works with your target voice in english, then it may be fine. |
Thank you Bebaam, I asked for the 3060 because @Andredenise is helping me with his gpu, we will work on it later, right now I will prepare the data. For tomorrow I hope we have good news! |
@AlexSteveChungAlvarez To train the synthetizer, did you pre process the dataset? Did you get the right accents fro the spanish language?? Did you specify different characters in symbols and cleaners? Thank you! |
Hi @ireneb612! Yes, actually the code itself has a script for preprocessing. I found out with help of the community that the mozilla's commonvoice dataset was the best because of the variety of accents for the language. Yes, I specified the characters in symbols and cleaners as was mentioned in the different issues (and I think in the guide too). Here is the repo with the resulting code of my work: https://github.com/AlexSteveChungAlvarez/Real-Time-Voice-Cloning-Spanish , it includes the script to prepare the dataset too! |
Hello, I want to use a dataset in Spanish from Argentina, can this implementation be adapted for that? Any information is welcome. Thanks a lot ! |
Of course! You just need to put your dataset in the correct structure. |
Excellent ! one more question: in this issue you comment that the results you obtained do not resemble those of the target voice. Could you solve this problem? any suggestion? Thanks for your help AlexSteveChungAlvarez! |
I think we better discuss this via email, since it's not part of the issue, but yeah, in my opinion, the results, even of the most recent models, don't sound like the targets. If you want to achieve this, by now, you need to finetune the model with a dataset of the target voice, that works when you have many audios from the target to clone. |
I trained the synthesizer with this dataset: http://openslr.org/73/ .
The models obtained until 50k steps are here: https://drive.google.com/drive/folders/1pYc0YK6YfdikMONkR-29054_uMxTgy_g?usp=sharing . Though, the results are not even near to the target voice to clone. Any suggestions?
It does sound like a human, but not like the target.
Originally posted by @AlexSteveChungAlvarez in #789 (comment)
The text was updated successfully, but these errors were encountered: