Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train Synthetizer in Spanish #941

Closed
AlexSteveChungAlvarez opened this issue Dec 7, 2021 · 20 comments
Closed

Train Synthetizer in Spanish #941

AlexSteveChungAlvarez opened this issue Dec 7, 2021 · 20 comments

Comments

@AlexSteveChungAlvarez
Copy link

I trained the synthesizer with this dataset: http://openslr.org/73/ .
The models obtained until 50k steps are here: https://drive.google.com/drive/folders/1pYc0YK6YfdikMONkR-29054_uMxTgy_g?usp=sharing . Though, the results are not even near to the target voice to clone. Any suggestions?
It does sound like a human, but not like the target.

Originally posted by @AlexSteveChungAlvarez in #789 (comment)

@Bebaam
Copy link

Bebaam commented Dec 8, 2021

How many different speaker are there? At least 300 is suggested. If you train with a batch size of 12 as you mentioned in #940, maybe the model did not yet converge after 50k steps. Maybe it will need 100k steps+. Moreover, I assume you did learn attention, as you have kinda good quality, but wrong voice? Then the question is: Did you finetune your model? To get good results it is crucial to finetune it for a single speaker, this will vastly improve quality. Have a look on #437

@AlexSteveChungAlvarez
Copy link
Author

What I want to achieve is to be able to clone any unseen voice during training, as the English pretrained model does, but in Spanish. That's why I didn't finetune it. Here is the loss at 50k steps:
image
Unfortunately, there isn't any information about the number of speakers in this dataset.
Before, I tried with samples of the cv-corpus dataset (https://commonvoice.mozilla.org/es/datasets), which has plenty of voices, but I don't know why the outputs from target audios were a lot of noise, like whispering, or silences, even when the target audios was one from the training it didn't output the text passed, but it did output the same audio with much less quality. I tried with samples of that dataset, because with the entire dataset there was an error which I attached in #789 (comment) . Should I continue training with Crowdsourced high-quality Peruvian Spanish speech data set until 100k+ steps?
Or maybe you know how to solve the issue with the cv-corpus so that I can train on it?

@Bebaam
Copy link

Bebaam commented Dec 9, 2021

Training loss varies with datasets, but it doesn't look wrong.
As mentioned in #30, maybe you need to train an own encoder for a new language.
For commonvoice, which tsv file did you use?
Maybe it is really important to have each speaker as an own folder to distinguish voices, as we discussed in #934

@AlexSteveChungAlvarez
Copy link
Author

For commonvoice I used the one that comes with it, and generated each .txt for each audio. To solve the problem of training a new encoder for the language, I have tried to clone an audio in English, so it detects it well, and then put the text in Spanish, since the synthesizer is being trained in spanish.

@Bebaam
Copy link

Bebaam commented Dec 9, 2021

Ok I only know commonVoice for other languages, there we have multiple .tsv files. Maybe it is different to spanish commonvoice dataset.

@AlexSteveChungAlvarez
Copy link
Author

Oh, now I get what you were asking, I used validated.tsv to copy all the audios from the original into a new directory with only the validated ones and from this file also got the .txts

@AlexSteveChungAlvarez
Copy link
Author

Did you try to use commonVoice with the code in this repo? What suggestions can you give me about it? I haven't found another dataset with such many speakers as it has, yet.

@Bebaam
Copy link

Bebaam commented Dec 9, 2021

CommonVoice should be the best dataset by far, the sheer amount of speakers I did not find anywhere else.
For me, the quality of validated.tsv was not good enough, I assume there are all speaker, for which the corresponding texts are more or less verified. In contrast, train.tsv fits better, maybe this is the subset of validated having comparatively good quality.

@Bebaam
Copy link

Bebaam commented Dec 9, 2021

But I am afraid that the problem lies in the encoder, as the cloning quality depends mainly on the encoder. I remember it was stated in some issue, but did not find it. But this by blue-fish should indicate in this direction:
#162 (comment)

  1. quality may differ from person to person
  2. if the encoder isn't familiar with voices like yours, it can't encode it accordingly.
    So if you are really interested in having very good quality, I would think about training an encoder. But keep in mind that this will need much more time than training a synthesizer.

@AlexSteveChungAlvarez
Copy link
Author

I already tried the pretrained.pt files with the same audio of my voice and it worked, that's why I don't think it may be the encoder, if it was, then with the pretrained.pt wouldn't have worked. As I said before, the target audio is me speaking in English like for 10-11 seconds. Then, the problem should be in the model I get from the synthesizer. I am pretty sure the issue is caused because I don't have enough speakers to train on. Now that you said I should use train.tsv, maybe that was the issue with commonvoice. Did you see my older post here #789 (comment)? All of that was with the validated.tsv, I will try right now with the train.tsv and see how it works. I will try and error until I get this Spanish model! Thank you for being alert!

@AlexSteveChungAlvarez
Copy link
Author

By the way, which batch size do you recommend for a RTX 2060 and for a RTX 3060?

@Bebaam
Copy link

Bebaam commented Dec 9, 2021

Okay, now I understand you idea. If it works with your target voice in english, then it may be fine.
I see your older post, I would try with train.tsv and if the error still occurs, then I would try to search for nan-values in the data. Maybe a few files are corrupt.
The batch_size depends on your gpu VRAM. The more the better in my opinion, so just try with your 2060 6GB I assume, how high you can set the batch_size without getting cuda memory errors. For the 3060 12GB, you can easily double the amount.

@AlexSteveChungAlvarez
Copy link
Author

Thank you Bebaam, I asked for the 3060 because @Andredenise is helping me with his gpu, we will work on it later, right now I will prepare the data. For tomorrow I hope we have good news!

@AlexSteveChungAlvarez
Copy link
Author

AlexSteveChungAlvarez commented Dec 26, 2021

Hello! I want to ask a few things about attention...My synthesizer model is already above 225k steps but the graphics about attention seem worse than previous graphics. For example:
step 210500
attention_step_210500_sample_1
step 229000
attention_step_229000_sample_1
Like these two examples, there are a lot of graphics that some times seem more likely to the 210500 and other times to the 229000, I am worried that it may be overfitting maybe? I also want to know if this metric and the mel-spectrogram are the only ones that I can compare to Corentin's model, or if I can make another comparison between the two models of the sinthesizer. I don't know when I should stop training the synthesizer, too.

@ireneb612
Copy link

@AlexSteveChungAlvarez To train the synthetizer, did you pre process the dataset? Did you get the right accents fro the spanish language?? Did you specify different characters in symbols and cleaners? Thank you!

@AlexSteveChungAlvarez
Copy link
Author

AlexSteveChungAlvarez commented Feb 24, 2022

Hi @ireneb612! Yes, actually the code itself has a script for preprocessing. I found out with help of the community that the mozilla's commonvoice dataset was the best because of the variety of accents for the language. Yes, I specified the characters in symbols and cleaners as was mentioned in the different issues (and I think in the guide too). Here is the repo with the resulting code of my work: https://github.com/AlexSteveChungAlvarez/Real-Time-Voice-Cloning-Spanish , it includes the script to prepare the dataset too!

@pauortegariera
Copy link

Hello, I want to use a dataset in Spanish from Argentina, can this implementation be adapted for that? Any information is welcome. Thanks a lot !

@AlexSteveChungAlvarez
Copy link
Author

Of course! You just need to put your dataset in the correct structure.

@pauortegariera
Copy link

Excellent ! one more question: in this issue you comment that the results you obtained do not resemble those of the target voice. Could you solve this problem? any suggestion? Thanks for your help AlexSteveChungAlvarez!

@AlexSteveChungAlvarez
Copy link
Author

I think we better discuss this via email, since it's not part of the issue, but yeah, in my opinion, the results, even of the most recent models, don't sound like the targets. If you want to achieve this, by now, you need to finetune the model with a dataset of the target voice, that works when you have many audios from the target to clone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants